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PREFACE 


With the recent developments in statistical methodology of clustering 
and classification as well as in automatic remote sensing data processing 
techniques, the scope of extracting useful information on earth resources 
from multispectral scanner data has increased considerably. Though the 
performance of most of the previously developed remote sensing data handling 
techniques has yet to be ascertained through the process of testing and 
numerical evaluations, it now appears that the development of a technology 
for performing a large area earth resources inventory is in sight. 

For the earth resources project of EOD, NASA-JSC, presently a top 
priority has been given to the development of a system for performing an 
inventory of some crop or crops of economic interest over a large area. An 
important aspect of this project would be to devise a suitable crop acreage 
estimation procedure. A major part of the research work reported here in 
our final report is concerned with this problem. 

First of all, a model is developed for the evaluation of crop acreages 
(proportions) in an agricultural area using the classification approach. The 
model takes into account the classification errors likely to arise in labeling 
remotely sensed data points under the classification algorithm used and evalu- 
ates the actual crop proportions by correcting the expected labeled crop 
proportions for the possible bias due to these errors. 

If the goal is to determine crop acreages for a large area, it will be 
necessary to use only a subset of the full data obtained using a sampling 
technique since it would not be practical to collect and process a complete 
set of scanner data covering the entire area. Depending upon whether or not 



any information is available on the area crop layouts, agricultural practices, 
etc., a suitable sampling scheme needs to be devised for acquiring a well 
representative sample of the unlabeled remotely sensed data. The precision 
of any crop acreage estimates will also depend upon the performance of the 
classification algorithm used in labeling remotely sensed data points and 
whether or not the associated classification errors are known. It is very 
unlikely to have these errors known in advance. As such, a certain amount 
of ground truth, preferably a sample of ground truth for each crop type in 
the area of interest, needs to be ascertained so that the classifier is 
properly trained and the classification errors estimated. 

Considering these aspects of the problem, we have discussed the estima- 
tion of crop acreages for different possible cases. If the classification 
errors for the classification procedure used are assumed known, the estima- 
tion method provides best estimates for the crop acreages. In case of 
unknown classification errors, the estimates are consistent. Next, both the 
error analysis and the problem of sample size are investigated in general 
as well as for certain specific cases. Our results are given in reports 1, 

4 and 5 listed in the table of contents. 

A study of classification errors for the Gaussian maximum likelihood 
classifier is made when due to interest in identifying elements of only one 
class, the other class is made of the remainder of the classes and then the 
problem is treated as a two-class classification problem. Given in report 
6, depending upon the geometry and location of the classes under this practice 
it is shown that the classification performance for elements in the class of 
main interest may or may not improve under this practice. 



The estimation of optimum errors of classification and the dependence 
of these estimates upon the distance between classes are examined fbr the 
two-class problem, assuming classes to be univariate normal, in report 2 • 
Considering both the maximum likelihood estimate and the minimum variance 
unbiased estimate for the optimum probability of misclassif ication, we give 
the relative efficiencies of these two estimates, theoretically as well as 
numerically, and investigate the bias of the former. 

A simulation study in done on the relationship between the probability 
of misclassif ication using linear as well as quadratic discriminant rules, 
the number of features and the training sample size. As shown in report 3, 
when the sample size is small, use of a fewer number of features for dis- 
crimination leads to smaller probability of misclassification. 

For classifying an observation into one of two given normal populations 
whose parameters are unknown, the usual practice in the absence of any 
training sample is to cluster past data into two nearest neighbor clusters 
and to design a sample based Bayes 1 classifier, treating the two clusters 
as training samples from the two populations. The use of £-norm is often 
advocated for such clustering. In report 7 it has been shown that such 
advocation is not always reasonable. 

A unified theory of adaptive pattern recognition has been presented in 
report 8. It has been shown that all adaptive pattern recognition algorithms 
are just different means of approximating a function that separates the sets 
of training samples, choosing different criteria for best approximation. 

The data obtained by remote sensing devices in Earth Resources Survey 
come from a large area and often over a large period of time. Due to changes 



in spatial and temporal conditions the statistical characteristics of the 
data have been found to undergo changes. In such a situation, the performance 
of sample-based Bayes classifier designed on the assumption of normality of 
the populations can be enhanced by periodic updating of the parameter esti- 
mates. It has been shown in report 9 that all updating algorithms that can 
be found in literature are related to one particular model of the random 


environment . 
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ESTIMATION OF A LARGE AREA CROP ACREAGE 
INVENTORY USING REMOTE SENSING TECHNOLOGY 

0. SUMMARY 

Based upon the existing remote sensing capabilities, the useful informa- 
tion about the acreage of some crop of economic interest can be obtained from 
multispectral scanner measurements acquired over an agricultural area. If 
the goal is to determine the acreages covered by various crops over some large 
-area such as the continental United States, then some sampling procedure will 
be necessary since it would not be practical to collect and process a set of 
scanner data covering the entire area. 

In this report we develop a model for the evaluation of acreages (propor- 
tions) of different crop-types over a geographical area using a classification 
approach and give methods for estimating the crop acreages. If prior informa- 
tion is available on the classification errors associated with the classifica- 
tion algorithm used, the estimation method provides the best estimate for the 
crop acreages. Otherwise, the method would first require a certain amount of 
ground truth in the area of interest to be obtained so that the classifier 
can be trained and the classification errors estimated. 

If the main interest lies in estimating the acreages of a specific crop- 
type such as wheat, it is suggested to treat the problem as a two-crop problem: 
wheat vs. non-wheat, since this simplifies the estimation problem considerably. 
The error analysis and the sample size problem Is investigated for the two- 
crop approach. Certain numerical results for sample sizes are given for a 
JSC-ERTS-1 data example on wheat identification performance in Hill County, 
Montana and Burke County, North Dakota. Lastly, for a large area crop acreages 
Inventory we suggest a sampling scheme for acquiring sample data and discuss 
the problem of crop acreage estimation and the' error analysis. 



1. INTRODUCTION 


In recent years the development of several automatic data processing 
techniques for statistical pattern recognition has enhanced considerably 
the scope of remote sensing technology for the study of earth resources, 
particularly in the field of agriculture. It now appears that a system 
for performing a large area crop inventory can be developed on the basis 
of existing remote sensing capabilities. 

The data handling and analysis for remotely sensed agricultural resources 
over a large area may not be feasible both from technical and economical view- 
points if each scanned data point is being processed for its recognition. For 
example, if a complete recognition is desired for an ERTS scene, it would re- 
quire processing over half a million data points. As such, an important 
requirement for any system to be developed for a large area crop inventory 
should be to have a suitable crop acreage estimation technique that uses 
only a sample of the unlabeled remotely sensed data obtained for the area 
of interest for the purpose of recognition. 

In this report we discuss a large area crop acreage estimation procedure 
that would meet this requirement for the system. We develop a model for the 
evaluation of crop proportions for an agricultural area and provide methods 
for crop acreage est ima tion , taking into consideration the classification 
errors likely to arise in labeling remotely sensed data. The error analysis 
for the model is studied and expressions for variances of different estimates 
are given, in general as well as in specific cases. For the two-crop situa- 
tion, the problem of sample size is investigated and certain numerical results 
for the sample size are provided. Next, we extend the scope of our study to 
investigate a large area crop inventory. 



2. CROP PROPORTIONS MODEL 


Suppose there are m different crops in the agricultural 

area of interest and that every data point is identifiable with respect to 
one of these crops. Let p^ denote the proportion of pixels in i=l,2,...,m. 
Considering a random sample of n unlabeled remotely sensed data points, let 
n^ be the number of points classified into i=l, 2, . . . ,m,using a classifi- 
cation algorithm. Suppose n(ijj) is the number of data points to be in 

but classified into n . , then 

l 

n^ = n(i|l) + n(i | 2) + + n(i|m) 


and 



i=l,2, . . . ,m 


(2 


are .the , observed crop .proportions for the. sample data under the classification 
algorithm used. The observed proportion n^/n is a biased estimate of p^ since 
it estimates unbiasedly E[nVn] given by 




m 


- J Pj PUlJ) 

j=l 

where P(i|j) denotes the probability of classifying a data point from ti\ 
into ti\ tinder the classification algorithm. It may be pointed out that 
processing of remotely sensed data for total recognition would lead to an 
evaluation of the expected classified crop proportions instead of the 


(2 
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actual crop proportions p^s* Of course, if the classification algorithm 
performs so well that the classifiction errors are sufficiently small, 
will be close enough to p^, i=l,2 , . • . ,m. But most statistical pattern 

recognition techniques for processing of remotely sensed data are expected 

■ / 

to be fallible and thereby the two types of proportions are not going to be 
near equal. Henceforth in our discussion we will assume that P(i|j) > 0 
for at least one j different from i, 

A 

Denoting the observed proportion n^/n by e^,i=l, 2, . , . ,m, it follows 
from (2.2) that 

A 

e = E[e] 
or 


where 


and 


e = Pp 


1 

rc 
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p(i|D 

P(l|2) . . 

. . P(l|ra) 

P = 

P(2 J 1) 

P(2 1 2) . . 

. . P(2|m) 




P(m|l) 

P(m|2) 

P(m jtn) 


v 


Accordingly, the vector of actual crop proportions 

p = P -1 e 


m 




1 provided e and P are known. 


i=l 


(2.3) 


(2.4) 


are obtained subject to 
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The vector of classified crop proportions, e, can only be known if the 
complete set of remotely sensed data is processed for total recognition using 
a classification algorithm* Hence, in general, e will be unknown. Regarding 
the classification error matrix P, two cases arise: 

(i) P is known 
(ii) V is unknown . 

If P is known, an unbiased estimate of p is 

p = P _1 e . (2.5) 

But P will generally be unknown. As such it would be necessary to obtain a 
certain amount of suitably selected gound truth in the area of interest, 
probably independent of the sample data used in estimating e, so that the 
classifier is trained and the classification error matrix estimated. Let 

A 

P be an estimate of P. Then an estimate of p when P is unknown is given by 

A 

i = P _1 e . (2.6) 

A 

A 

Clearly, p will generally be a biased estimate. Both bias and mean 
/\ 

A 

square error of p will depend upon the performance of the classification 
algorithm as well as the degree to which the sample represents the popula- 
tion. The classification performance can be achieved desiredly by training 
the classifier sufficiently on the basis of a well representative sample for 
the ground truth. By adopting certain sampling schemes that may be suitable 
for the area of interest, appropriate samples for both the ground truth and 
the unlabeled remotely sensed data can be acquired. For our later discus- 
sion we assume that these two types of samples are independently obtained. 

In appendix 1 we have derived general expressions for the covariance 

A. ^ 

matrix of p>and both the bias and the mean square error matrix of p. In the 
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case of p not only the estimate p itself but bias as well as mean square 
error quantities will also depend upon how P is obtained* One solution for P 
and the probability distribution of its components is suggested therein, as well. 

3. TWO-CROP APPROACH 

Sometimes the main interest is in estimating the acreage of a specific 
crop type in the area of interest* In that case one approach to the acreage 
estimation problem lies in considering to be the specific crop type and 
7 Tq to be the ,f other crop' 1 consisting of the remainder of the crops, and then 
treating it as a two-crop situation. However, lumping of different crops 
together for the IT other crop" would require certain caution and should be 
judged in terms of the classification performance for the two-crop case as 
against that for the case of the original set of crops. For the Gaussian 
maximum likelihood classifier, Basu and Odell (1973) have investigated this 
problem and have shown that the classification performance for the class of 
main interest may or may not improve when the classification is performed 
using the two-class approach. But the problem under this approach is 
greatly simplified and, barring extreme cases, perhaps it will provide 
satisfactory solutions in the remote sensing situation when interest only 
lies in ascertaining the acreage cover of one specific crop. 

Now considering two crops tt^ and let P(l[o) - and P(o|l) - $ 
for the probabilities of misclassification when a certain classifier is 
used. We will assume that ^ !• Then 
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If p^ and are the actual crop proportions of ir^ and ir^, respectively, 
whereas e^ and e^ are their respective expected classified crop proportions 
under the classifier used, it follows from (2.3) that 

e^ = (l-$ 2 ) ^ (1-p-j^) (3.1) 

and 


e 2 “ 


On the other hand. 


Pi " 


e l ~ *1 


1 l-$ -* 2 


(3.2) 


and 


Po " 


e 2 “ *2 


2 l-* 1 -* 2 


or p 2 = l'-Pj^. 


Suppose from a random sample of n unlabeled remotely sensed data points. 


n^ points were classified into and n 2 = n-n^ points were classified into 
Hq by the classifier. Then 

n l 

e i = y • 

- e x - 


P 1 l-$ 1 -« 2 


(3.3) 

(3.4) 


if and are known, and 


e i - b 

A A 

1-* -$ 
V 1 *2 


(3.5) 


when and are unknown and have estimates and respectively. Clearly 


Var(p ) = x Var(e n ) . 

a-» x -v 2 

A 

/S 

For the estimate p^ in (3.5), it easily follows that 


(3.6) 
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Bias ( P;L ) = (e 1 ”$ 1 > E[T-G] - EfT^-^) ] (3.7) 

where 

T = (l-*^) -1 

and 

e = ( i -^ 1 -$ 2 )" 1 . 

However, evalution of expected values in (3.7) may be quite a difficult job 
and so an exact bias value may not be accessible. An evaluation of the 

A 

A 

MSE(p^) in its exact form is even more difficult. As such we instead con- 
sider having it in the following approximate form obtained in Appendix 1 
using the 5 -method . For a discussion on the method, see Rao (1965). 

1 ' e-$2 e - $ 2 * 

MSE(p )= j (Var(e )+ [1- * _ * ] Var ($)+[ ! _/ l Var(* )) (3.8) 

[l-* x -* ] 12 12 

or 

A 

MSE(p^) = j [VarCep + U-pp^arCip + pj Var(i ) ] (3.9) 

ti- *r*2 r 

where p^ is given by (3.2). 

Sample Size 

Considering simple random sampling with pixel as the sampling unit, we 
discuss the problem of sample size necessary to minimize the sampling cost 
or to achieve a desired amount of precision for the proportion estimate, given 
that the other is specified. Suppose total sample consists of N^n + N^ + N^ 
data points selected randomly, where n is the size of sample of unlabeled 
remotely sensed data used for estimating e^, and and are sample sizes 
for ground truth data from tt^ and and are used to estimate and 
respectively. The estimates e^, and ^ are obtained as observed 
sample proportions and thus it follows from (3.6) and (3.9) that 
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- e (1 - e..) 

Var(p ) = -± ±- 

n(l-^ 2 ) 

and 


MSE( Pl ) = 




~ +a - p i > nI + Pi— 5— 


( 3 . 10 ) 


Suppose we want to obtain sample sizes necessary to minimize the sampling 

A 

A A 

cost when Var(p^) and MSE(p^) are specified, say each equal to or smaller 

2 

than a . In the case of known, the only cost involved is that of 

processing the remotely sensed sample data* Clearly, it will be minimum 
when the sample size n is the smallest integer greater than or equal to 


e i a - e i ) 

:(i^* 2 -t 2 ) 2 c : 


(3.11) 


For when and $> 2 are unknown, there are two types of cost involved: one 

is the cost of processing the total sample data, say at the rate of c dollars 

per data point and the other is the cost of obtaining ground truth, say at 

the rate of c^ dollars per data point. Then the cost associated with a 

sample of size N = n + + N 2 is of the form: 

C(N) = c^n + (Cj+c 2 ) ( W . (3.12 

The purpose is to find N (i.e. , n, and N 2 ) which minimize C(N) subject to 
A 2 

MSE(p x ) <_ a . This is done in Appendix 2 where we derive explicit expressions 
for n, and N 2 in (A. 9). 
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4. AN EXAMPLE 

Certain sites in Hill County, Montana and Burke County, North Dakota 
were selected to investigate wheat identification performance for the ERTS-1 
satellite data during 1973. For the sites in Hill County, there were three 
acquisition periods, covering both winter and spring wheat seasons, for 
which ERTS-1 labeled data were evaluated against the ground truth to ascer- 
tain wheat identification performance. In the case of the site in Burke 
County, there were only two acquisition periods covering the spring wheat 
season. For the classification identification performance results and other 
details, refer to Appendix 3. 

Considering to be the omission percentage for the non-wheat data 
points and $ 2 for the wheat data points, we give sample size results in 
Figure 1-7 for the various cases of omission percentages listed in Appendix 
3, assuming different wheat proportions in the area and = .01. Based on 
these results, the following conclusions are drawn: 

1. Expected labeled wheat proportion, e^, increases as the actual 
proportion of wheat, p^, increases for the area, though not strictly. 
Though to a certain extent it depends upon the magnitude of the 
omission percentages for both non-wheat and wheat data points, 

it tends to centralize away from too low or too high values for 
the percentage. 

2. Sample size for the unlabeled remotely sensed data first increases 
as the actual wheat proportion increases and then decreases later 
on; the point of decrease depends upon the size of the two omis- 
sion percentages. 
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3. All sample sizes increase as the total omission rate 4 + 4^ 

increases . 

4. Sample size, for the unlabeled remotely sensed data is much 
larger when 4^, $ are unknown compared to when these are known, 

5. In the case of 4^, 4 2 unknown, the sample size for the unlabeled 
remotely sensed data is proportional to c^/c^ the ratio of two 
types of cost* 

6. Sample sizes for ground truth of wheat and non-wheat are inversely 
proportioned to c^/c^* 

7* Sample size for the ground truth of wheat is larger than that for 
non-wheat when the expected labeled wheat proportion is below ,5. 
Reverse is the case when such proportion is above *5, A similar 
trend holds against the actual wheat proportion, though not 
strictly . 

8* Sample size for the ground truth increases as either of the two 
omission percentages increases when the other is held fixed. 

For making a comparison of sample sizes irrespective of the wheat propor- 
tion which, in fact, is unknown, a suitable criterian is to determine the sample 
sizes against values for the coefficient of variation, C.V. =a/p. Generally 
the wheat coverage in any area of interest is expected to be somewhere in 
between 1 percent and 20 percent. As such we here give sample sizes for the 
unlabeled remotely sensed data and the ground truth of wheat as well as non- 
wheat by specifying a = .01 and considering certain C.V. values in a 5 to 50 
percent range. Numerical results are presented in Table 1 for all different 
cases of 4^, values that arise from the wheat identification performance 



results given in Appendix 3. Moreover, for certain cases the sample sizes 
are sketched in Figure 8-14. The following conclusions are drawn: 

1. All samples sizes increase as the total omission percentage 

+ $2 increases. 

2. Except for the sample size for the ground truth of wheat, sample 
sizes decrease as the coefficient of variation increases. These 
are generally very high in numbers for the 5 percent co-efficient 
of variation but levels off when the co-efficient of variation is 
50 percent. 

3. Sample size for the unlabeled remotely sensed data increases con- 
siderably if $2 are unknown compared to their known case. 

4. Again, all sample sizes depend upon the ratio ^/C], as regards 
the two types of cost. 

5. Sample size for the ground truth of wheat is consistently larger 
than that of non-wheat. Also, it shows very small changes over 
the range of co-efficients of variations being considered here. 

In cases where there is a high overall omission percentage, and 
particularly for the non-wheat, it tends to increase as the co- 


efficient of variation increases. 
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TABLE 1: Sample sizes: n for the unlabeled remotely sensed data, for the ground truth of 

wheat, and N for the ground truth of non-wheat when c = .01 
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5. A LARGE AREA CROP ACREAGE ESTIMATION 

Our previous discussion, in essence, applies to crop acreage inventory 
for an agricultural area which is homogeneous in respect to agricultural 
practices and thus is not expected to be large enough. Since a major objective 
of the JSC-EOD project is to perform or estimate crop acreages for a large 
area using available remote sensing capabilities, we here suggest a sampling 
procedure to procure sample data for the purpose of estimating a large area 
crop acreage inventory and discuss the error analysis associated with it. 

Once again, we assume that the frame is made of agricultural areas; the 
non-agricultural areas in the region of interest can be easily excluded by 
way of a monitoring system. As a first step in the sampling procedure, we 
suggest having a geographical-based stratification which effects a division 
of the region into reasonably homogeneous areas with respect to physical and 
climatological conditions. Considering additional factors of (i) the pre- 
dominance of various crop- types and (ii) the latitude and longitude, a 
finer stratification must be achieved. This is to obtain better discrimina- 
tion for the underlying crop-types and to control variability which may 
otherwise dominate over the distinction that exists between the resolution 
classes for these crop-types. 

Note that as a result of stratification one may only need to consider 
a part of the region for frame if crops of interest do not cover the whole 
region. So depending upon whether the frame would require consideration of 
the complete region or only a part of it, one should make a list of strata 
making up the frame for the purpose of sampling. 

Remoting sensing data gathered by an ERTS satellite is documented in 
terms of scenes, each covering approximately an area of 100 x 100 miles and 
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divided into four strips where each strip has approximately 6,400 scanlines 
in it* As such, we suggest a three stage sampling plan to be independently 
carried out in each stratum: select randomly ERTS scenes at the first 
stage, strips within scenes at the second stage and scanlines within strips 
at the third stage. Of course, one may consider one more stage in selecting 
pixels within scanlines. However, sampling at this stage is excluded from 
the plan because it is inconvenient and uneconomical. 

Notations 


Let R be the region (in the sense of frame) of interest for estimating 
crop acreages. Suppose it is stratified into strata R t , t=l,2,...,L, with 
weights w^, the proportion of pixels in tth stratum, t“l, 2, ... ,L so that 


R = UR. 


with 


I w t = 


In stratum R^, let 1^ be the number of scenes whereas J, H and n denote the 
number of strips per scene, number of scanlines per strip and number of 
pixels per scanlines, respectively. From the previous paragraph it is 
obvious that there is no need to distinguish between strata in the categories 
of strips per scene, scanlines per strip and pixels per scanline. Next, let 
e tijh^ 1T k^ t ^ ie ex P ecte d proportion of pixels to be classified in tt^ from 
the hth scanline in jth strip of ith scene for stratum R^_, t=l,2,...,L. 

Then for R^, 

H 

e tij^k^ = 2 e tijh ( V 

h=l 
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the expected proportion of pixels to be classified in from the jth strip 


in ith scene. 


H 


5 ti (ir lP ^ 2 e tiih^ r k ) ’ 


j=l h=l 

the expected proportions of pixels to be classified in tt, from the ith scene, 

k 


H 


e tyh ( V > 


vv - £ I I 

i=l j=l h=l 

the expected proportion of pixels to be classified in n^. Accordingly, 


,( V “ 2 u t e t ( "k> • 


( 5 . 1 ) 


t^l 

is the expected proportion of pixels to be classified in k-l, 2, . . . ,m 5 
for the region R. 

Estimates 

Suppose m^, r and s denote the corresponding number of scenes, number 

of strips per scenes and number of scanlines per strip that one selected for 

R t , t=l, 2, . . . ,L, using the stratified three stage random sampling described 

earlier. Let n . (tt, ) be the number of pixels classified into tt from the hth 
tijh k K 

selected scanline in jth selected strip of the ith selected scene in R^. 

Then considering the observed proportions of classified data points into 
different crops for estimates, one has 

- , , 

e tijh ( V - — ; — - 


■tij 


<V S n tiih ( V> 


h=l 
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r s 


'ti^k^ nsr ^ ^ n tijh^ 1 I k^ * 


j = 1 h=l 


m r s 

!(».) - -i- y y y n . . 

t k nsrm ^ tij 

i=l j=l h=l 


tijh^k^ ’ 


and L 

a (ir ) — ^ (n^) 9 k 1 9 2 9 . • • ,tD * 

t=l 


Next, expressions for Var (e^i^) ) and Cov(e t (ir^) , ^(^Kk^ ') can be 
obtained without much difficulty. For example, refer to Section 10.8 in 
Cochran (1963) for the general discussion on three stage sampling plan. 

A. 

Hence, the covariance matrix of e is given by 


L 

Var(e(ir k )) = ^ w^ Var(e t (ir k )) 
t=l 


(5.2) 


and 

Li 

A A ^ A A 

Cov(e(ir k ), e(-rr k ,)) = £ w t Cov(e fc (ti^) , e fc ) ) , k^k', k=l,2,...,m . (5.3) 

t=l 

Similarly, an estimate of the covariance matrix is obtained by replacing 
the unknown quantities by their estimates in (5 . 3) . In this context , see 
Chhikara and Odell (1974) who have discussed such results in greater details. 

Now to obtain the actual crop proportions, there is a need to consider 
whether or not the classification error matrix is the same for each stratum. 

When the area is wide and large and the stratification is performed consider- 
ing factors mentioned in the beginning of this section, it is quite likely that 
these classification error matrices will not be the same for different strata. 



18 


In that case, find an estimate of p^, k=l,2, . . . ,m, using (2.5) if the classi- 
fication error matrix is known and (2.6) if it is unknown for each stratum. 

Denoting p. by p (it.) for stratum R , it then follows that 
k t k t 

L 

A \ 1 A 

P k = 2i, w t p t^\^ » k = 1,2 m (5.4) 

t=l 

when the classification matrices, say P^, t = 1,2,..., L, are known, and 

* * 

A \ A 

P k = 1 w t p t^\^ » k = 1,2, ... ,m (5.5) 

when these are unknown and are separately estimated using ground truth data 

A 

A A 

from each stratum. Next, Var(p fc ) and MSE(p k ) are respectively obtained from 
(A. 2) and (A. 7) after making an appropriate substitution from (5.2). 

On the other hand, either there is the same classification error matrix 
for all strata or can be made so by proper adjustment of signatures in the 
classification algorithm for each stratum. For then an estimate of crop 
proportions p k> k=l,2,...,m is directly given by (2.5) if the common classi- 

■A 

fication error matrix is known and by (2.6) if it is unknown, using e(7T^_), 
of (5.1) for e. Hence, both estimates and their error analyses 
are obtained by following the general procedure given in Section 2. 

In fact, our approach in Section 2 is quite general and can be applied 
to perform any large area acreage inventory by considering an appropriate 
sampling scheme for both the unlabeled remotely sensed data and the ground 
truthed data. 

Once again if interest lies in estimating only the wheat acreage, the 
two-crop approach of Section 3 can be applied. Then an estimate of wheat 
proportion is obtained from (3,4) or (3.5) as the case may be, either first 



obtaining it stratumwise and then combining as we did above in (5.4) and 
(5.5) or directly, depending upon whether or not the classification error 
matrix is the same for different strata. Subsequently, the precision of 
this estimate and the sample size necessary to achieve a desired precision 
with minimum cost can be easily obtained by applying our technique of 
Section 3. 

Sample Size 

Taking the cost factor into consideration, suppose we want to determine 
the sample size that either minimizes the sampling cost for a specified 
precision or maximizes the precision of the estimate for a fixed cost. 

Though a large initial cost is involved in acquiring remotely sensed data, 
presently we are mainly concerned with the cost of the processing and 
labeling of the sampled data. In general, any such cost can be considered 
as 


C = c.in + c 0 m r + c_m rs 
t 1 - 1 / t 3 t 


for the sample in stratum R , and 


C = (c^ + c 2 r + c 3 rs) ^ m t 


t=l 


for the area of interest. 

In case of unknown classification error matrix or matrices, there is an 
additional cost of sampling the ground truth, say C'. As such the total cost 
involved is C = C + C'. Now if the cost is fixed, say C" C^, a determina- 
tion of sample sizes for both the unlabeled remotely sensed data in all 
three categories and the ground truth for various crops can be achieved by 
solving equations obtained by equating the partial derivatives of 
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MSE(p k ) + A(C"-C 0 ) , k = 1,2,..., m 
where A is a Lagrange multiplier, with respect to i , r, s and the ground 

t- M 

- 2 

truth sample sizes to zero. Similarly, when the MSE (p ) is fixed, say a , 

K K. 

k=l, 2 , . • . ,m, again this can be achieved by considering the function 

C" + A k [MSE(p k ) - a k ], k = 1,2,..., m 
for minimization. This, of course, would lead to k different values for 
various sample sizes unless we consider the minimization from the point of 
a specific crop-type proportion estimate. On the other hand, a unique 
determination can be obtained by considering the largest value obtained 
in each case. 

It may be pointed out that under this procedure, it will be difficult 
to give any closed form expression for any sample size and its carrying out 
would involve some optimization technique. 

If the classification error matrix (or matrices) is known, the sample 
sizes m^Ct-1,2, . . . ,L) , r and s can be easily determined by minimizing 

A A A 

VarCe(ir k )) + A(C-C Q ) or C + IjJVar (e(ir k ) ) - o^\ as the case may be. More- 
over, the sample size problem in the case of unknown classification error 
matrix or matrices can be treated either by assuming the classification 
errors known or by investigating the two types of sampling separately. 



6 . FURTHER REMARKS 


In actual practice it may not be possible to have every data point 
identified with one of the crops in the area of interest, particularly if 
the area is large. This may be caused by not knowing all crop- types that 
exist in the area or some data points representing pixels falling on the 
field boundaries. As such the model developed in this report may be viewed 
somewhat restricted. Its use for performing a large area crop inventory 
may be considered subsequent to obtaining information about the agricultural 
practices in the area. 

It is extremely difficult to model the problem of a large area crop 
inventory in its full generality unless certain constraints are imposed. 

The condition of identif lability is one such constraint that one must have 
in order to deal with the problem analytically . 
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APPENDIX 1 

A 

A.l. Variances of Components of p 

/s 

For p given in (2.5), the covariance matrix. 


a a rn A A rp _ rp 

E[(p - p)(p - p) 1 ] = P 1 E[(e - e) (e - e) ] (P ) 


or 


2 P = (p b 


-I T 

V (P V 


A ^ 

where V denotes the covariance matrix of e. Denoting the (i, j) 

-1 ii A 

element of P by p , it follows that the variance of p_^, the ith 


element of p, is given by 

m 


m m 


Var (p ± ) =^(P ±j ) 2 Var (e^) + £ ^ P lj P lk Cov (e.. , e fe ) 


j-1 


i=l k=l 
j#c ' 


A A 


where V(e.) and Cov (e., e,) would depend upon the sampling scheme 
J J 

used for obtaining samples of unlabeled remotely sensed data points. 

In the case of random sampling with sampling unit as pixel (i.e. one 
data point) , 

a e. (1 - e . ) 

Var (e.) = J— 

3 n 


and 


e.e. 


A A i 1c 

Cov (e^. , e k ) = - — , j t k, j , k = 1, 2, . . , m, 


ignoring the finite population correction due to large population size. Next 


(A.l) 


(A. 2) 


(A. 3) 
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an unbiased estimate of these quantities is given by 

e. (1 - e.) 


/N A 1 

Var (e . ) = — “■ . 

v j n - 1 




A A 


Cov (e^. , ej 


e * e u 

= - n J - _ j ^ k> j, fc = 1, 2, . 


m. 


On the other hand if the sampling unit is a 5 x 6 mile segment con- 

sisting of r pixels then considering a random sample of m segments 

(here for the sample size one may consider n - mr data points) from 

the total of M segments in the area of interest and again ignoring the 

finite population correction, one gets (Cochran, 1963) 

M 

k 2 


and 


Var (e.) = —ttt 
J m(M 


■i-jr J 

- 1) 31 3 

i=l 


r 


M 


(A. 4) 


Cov 


( j = \ ■ Y (e " e -j 


m (M " i) i=l 


)(e ki - e k>- d * k 


J . k » 1, 2, . . , m 


where e. . denotes the proportion of classified data points in ir. for 
J ^ 3 

the ith segment. Once again, for their unbiased estimates 

m 


/\ 

Var (e.) = 

3 


m(m - 1) 


\ ' A A 

/ (e. . - e. 


)‘ 


i=l 


and 
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/\ * a 
C ov (e J} e fc ) 


m 


m(m - 1) 


r * a a 

> (e . . - e. 


)(e ki - \ ) > 1 * k * 


i=l 


J > ^ ^-5 ^ 9 • • » ) 


Similarly, components of V and their estimates can be obtained for other 
types of sampling plans* Making appropriate substitution in (A.l) or 

A. 

(A. 2), variances for the components of p and their estimates are then 
obtained. 

A 

A. 2. Mean Square Errors of Components of p . 

A 

A 

First we calculate the bias of p given by 

A A 

A A 

Bias (p) = E [p-p] 

-I 

= E [p e - P e] 

1 A A -1 -1 
= E [P (e-e) + (P - P )e] 

= E [P^ 1 - P _1 ]e (A. 5) 

A A 1 

because the first term is zero due E(e-e) - 0 for a given P . Clearly, 
the bias depends upon how much bias there is in P \ and 

A A « 

Bias (p) = (Bias (P ))e. 

A 

A 

In order to find the mean square error of any component of p, let us 
first consider the evaluation of matrix, 

A A 

A A rp A -I A 1 A-1A IT 

. E [(p-p) (p-p) 1 ] = E [(P e - P e)(P e - P e) 1 ] 



25 


so 


a 1 a /v m a •» m <a. *1 -I rp “t 1 ryi 

E [(P )(e-e)(e-e) (P 1 ) i + (P 1 - P l ) ee (P _i - P ) ] 


E [(P 1 ) V (P _1 ) T ] + E [(P 1 - P 1 ) ee T (P 1 - P 1 ) T J 


(A. 6) 


where E stands for expectation with respect to P, Again, denoting the 

"“1 ii a _ a , , 

(i, j)th element of P by P J and that of p by P 1 ^ , it follows from 

A |u^ A 

(A. 6) that the mean square error of the i component of p, is given 


by 


MSE (p ± ) - E 


. r m 


m 


nr m 


(P 1 ^) 2 Var (e 


L j=l 


m . m 


j )+ I! ^ ik c ° v 

j=l k=l • • 


J e^ E [P 1 ^ - P lj ] 2 +^ Je^e^tCP^-P^JCP^-p^)] 


j=l 


j=l k=l 

j^k 


(A. 7) 


i ^ 1 j 2 , > • » y m . 


Once again, V, i.e, Var (e.) and Cov (e., e*) , j and k - 1, 2, m, 

J J ^ 

may be obtained as in (A. 3) and (A. 4). If some other sampling plan is used 

■A 

for selecting remotely sensed data to obtain the estimates e j' G > ex pres- 
sion for V can accordingly be obtained. To evaluate expectation in (A, 7), 

A 

one needs to find the distribution of P. This will, of course, depend 

A 

upon how P is obtained. In general, it will be difficult to obtain any 

A 

exact distribution of P. However, if the sampling of ground truth involves 

A 

separate independent samples from each crop and P is obtained as the 
matrix of observed proportions among randomly selected pixels classified 



into different crops using a classifier, each column vector of P has a 
multinomial distribution and is stochaotically independent of the others 
in P. Since expectation in (A. 7) is for elements of P , it may not be 

A 

easy to derive the MSE in a closed form, especially if the number of 

crops is large. 
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APPENDIX 2 


Two-Crop Case 

A 

.First we derive the MSE(p^) as in. (3.8). 

Proof of (3.8) 

A A A 

Considering the estimates $ 2 and e, being obtained from 

independent sets of samples, it follows by an application of the 6-method 


that 


Hence 



- 


1 1 


Var(? 2 ) 


A 

MSE(p i ) ‘ a-^-t 2 ) 1 2 ( Var(e i> + E ' 


V*1 

l-$ -® 

1 2 


“l2 


VarC^) 


L 1_t rhJ 


Var(o) 


)• 


Here dot with equality sign means equality with approximation. This 


establishes (3.8). 


For a determination of sample size necessary to minimize the cost 

2 

subject to MSE (p^) <_ a as discussed in section 3, it is achieved by 


Var(4> 2 ) 


minimizing the function 
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F = C (N) + X (MSE (p x ) - o' ) 


with respect to n, and N^, where C (N) is given in (3.12) and MSE(p^) 
is given in (3:10).. By rewriting, we have 


F = c x n + ( Cl + c 2 ) (N 1 + N 2 ) + X (1 - - $ 2 ) 


-2 


e 


.r e l ( 1 " e l ) + (l-p ^ 2 $ 1 + V 2 ^2 (1 ~V - o 2 (1-^ 1 -^ 2 )' 

n N 1 1 N 2 


Taking partial derivatives of F with respect to n, and N 2 and equating 
each to zero, one obtains the following set of equations. 


c 1 - X(l-4> 1 -$ 2 ) 


-2 


= 0 


n 


-2 ,, _ -.2 *, ( 1-0 


( Cl + c 2 ) - (i- Pl r "i ^ o = o 


1 2 


N, 


-2 2 


( Cl + c 2 ) - XCl-*^) P x (l-4 2 ) 


- 0 


N, 


Considering only the admissible solution of these equations, one has 


n “ 


Ae i (1 ” e l^ 
C JL 


N- = (1 


N 0 = P 


V 

V x Ol-O 

— 1 

( C 1 + C 2^ 1_ ®1 ~ 

■ i 


X _ V^V 

(c L +c 2 ) (l-$ 1 -<f 2 )‘ 


(A. 8) 
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1 z 

Considering that MSE (p^) = a and making substitution in (3.10) 
for n, and N 2 obtained in (A. 8), one gets 


vT = 


a (W^) 


[V^a-ei) + (l-Pj,) V (c 1 ' ta 2 ) + P l\/ (c l 4lC 2 ) V 1 "V • 


After substituting forVTin (A. 8), the sample sizes n, and N 2 are obtained 
as following: 

n fj e l (1 ~ e l ><,|: l (l-e 1 ) + (1-pp V(c 1 +c 2 )* 1 (l-1’ 1 )+T ] v / (c 1 +c 2 )'t 2 (l-4' 2 ) I 

M c 2 <U -♦ ) J J 


Nj - y<1 ~ P l 


> Tv Ev[- 

, .2 T ... L 


0 a-« 1 -* 2 )“ ' C’pcj 


v4 1 e 1 (1-e^ + (l-p 1 ) VCc^ ^) ^(1-^) 

+P 1 V(c 1 -bc 2 )$ 2 (l-4> 2 ) ] 


*2 = 


a 2 ( 1 - 0 ^ 2 ) 2 


V .® 2 (l-$2) V (l-e^) + (l-p-jX/ 

C l +C 2 y “j 

+P X V(Ci+c 2 ) $ 2 (l-4> 2 )J . ( A. 9) 


It can be easily seen that n is a monotone increasing and , N 2 are monotone 
decreasing functions in c 2 / c^, the ratio of two types of cost. For when 
e l’ *i’ ^ 2 * are un ^- nown > estimates of n, and ^ can be obtained from (A. 9) 
by replacing these unknown quantities by their estimates. 


t 



APPENDIX 3 


ERTS-1 DATA INVESTIGATION 
FOR 

WHEAT IDENTIFICATION 


Hill County , Montana 


Complete ground for evaluation in 2 x 6 mile area in Hill 
County North 

Ground identifications of wheat, barley, oats in Hill County 
South 


ERTS-1 data evaluated at three acquisition periods covering 
spring and winter wheat seasons 


Date 


Winter Wheat Stage Spring Wheat Stage 


May 

June 

July 


Greening 

Heading 

Mature 


Pre-emergence 
100% cover 
Headed 


Classification performance results: 


W - Spring/Winter Wheat 
NW - Oats/Barley /Pasture 

Commission/ Omission Percentages 
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100 
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, June 
,t 2 ) 


May, 

(t x ,t 

July 

: 3* 

May , June 
<t r t 

, July 
2> t 3 ) 



2. Burke County, North Dakota 

. Complete ground truth for evaluation in 2 x 10 mile area 
. ERTS-1 data evaluated at two acquisition periods 

Date Spring Wheat Stage 

June 5 3"-4" growth 

June 23 Jointing 

. Classification performance results: 


W - Spring Wheat 

NW - Barley/Oats/Pasture/Summer Fallow 
Commission/Omission Percentages 



W 

NW 


w 

NW 

w 

P 5 

25] 

w 

'85 


NW 

Lio 

90J 

NW 

JO 

9o[ 


June 

5( tl ) 
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W NW 

W po 10 

NW |_5 95 

June 5j June 23 
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Figure 4 
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Figure 5 
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Figure 6 
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Case: <£>. , <J>~ unknown and c~/c, = 20 


Figure 7 
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Sample sizes: n for the unlabeled remotely sensed data, and ^ and N 2 for the ground truth 

of wheat and non-wheat, respectively, when precision for the proportion 
estimate is specified by a = .1. 
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ESTIMATION OF OPTIMUM ERRORS OF CLASSIFICATION 


FOR UNIVARIATE NORMAL POPULATIONS 
by 

Raj S. Chhi ka ra 
and 

Patrick L. Odel 1 
1 . I ntroduct i on 

For the multivariate normal populations, the problem of classification 
using the Bayes' discriminant procedure involves certain difficulties as to 
an exact evaluation of actual errors of classification, i.e. optimum proba- 
bilities of mi scl ass i f i cat ion , and an optimal estimation of these errors, 
particularly in the case of small samples. If the populations are assumed 
to be univariate normal, these errors of classification are easily expressi- 
ble in a close form. Yet even for this simplified case the problem of their 
estimation has not been fully examined. For example, to the authors' best 
knowledge no minimum variance unbiased estimates of these errors are given 
so far in the literature. 

Recently Sorum [8] investigated the estimation of both expected and 
optimum errors of classification associated with the linear discriminant 
function for univariate normal populations with known common variance. For 
the optimum errors of classification, she considered the maximum likelihood 
estimates and their various slight variants in her study involving large 
samples. Hill [6] too considered the same problem but his investigation was 
primarily motivated towards examining expected errors of classification. 
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For a brief statement of the problem, let and u be two normally 

distributed populations with means and respectively, and common 
2 

variance a . Having obtained an observation x from one of the two popula- 
tions, the problem is to identify the population to which it belongs. 
Assuming equal a prior? probabil i ties for ttj , ir 2 and a constant cost of 
mi scl ass i fy ? ng any observation, the Bayes 1 discriminant criterion amounts 
to: classify the observation x into ttj if 



Sometimes it is desirable to have the knowledge of P(i|j)'s. It provides 
an assessment of the confusion likely to arise between individuals of two 
groups and thereby it is helpful in correcting for bias, etc. For example, 
see Cochran [3] who deals with several aspects of statistical inference in 
presence of certain types of classification errors. 



In this paper we consider the problem of estimating P(i|j), i and 
j“l,2 when sample observations are available from the populations. In view 
of (1.1), one, however, only needs to consider the estimation of $(-A/2) 
when an estimation of P(i|j)'s is desired. 

Let X| y x.y , • . . , x be a random sample from population 7 r, and 

* 11 j 112 

be another random sample from population Considering the two parametric 

cases (i) ^ ,y 2 unknown and a 2 known and (ii) y p and a 2 all unknown, 
the maximum likelihood estimate (MLE) of §(-A/2) is 4>(-A/2) where A = (x-y )/a 

A 

in case (i) and A - (x-y)/s in case (ii); here x and y denote the two sample 
means and s is obtained by 


( n i +n 2~2) s - ^ (xj-x) 2 + ^ (y | -y ) 2 . 

l l 

Below in section 2 we obtain the minimum variance unbiased estimate 
(MVUE) of $ (“A/2) and then compare it with the MLE in section 3 by evaluating 
relative efficiencies of both MVUE and MLE with respect to the Rao-Cramer 
lower bound for variances of unbiased estimates of $(-A/2). It can be 
shown that the Rao-Cramer lower bound is 


1 / i , 1 V ,2, Av 

T + * <- 2 ] 


(I 


in case (i) and 


' f + _L_ + & 

^ r l n 2 2(n ] +n 2 ~ 1) 

in case (ii), where £ denotes the standard normal 
relative efficiency of an estimate is obtained by 
lower bound by the mean square error (MSE) of the 


density function. The 
dividing the appropriate 
estimate. 


0 
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2. MVUE of #(- A) 

( i) un ^ nown an< ^ a known: The random variable 




n l n 2 


where Z is the standard normal variate, is distributed normally with mean ~ 


and variance one. As a result 


*(- |) = Prob (Z/4-(~- + A-) + All < o) 
2 n l n 2 


= E [Prob(Z < - 


x - y 


°A-(— + — ) 

n l n 2 


= E[*(- 


n l n 2 


E denoting expectation with' respect to x,y, a set of complete sufficient 
statistics. Thus by the Rao-Bl ackwel 1 theorem we have the MVUE of $(- A) > 


*(- |) = $<- 


x - y 

aA-(i- + J-) 

n l n 2 


(il) Vj.Uj and ^ » a H unknown: Let U be a beta variate, ? 

2 2 2 

stochastically independent of s which is a x /v distributed with 
v = (n^+n^-Z) degrees of freedom (d.f.). Then by applying theorem 1 in 

Ellison [ 5 ] , it follows that (2U-1) AT s/a has the standard normal distribu- 
tion. Furthermore, the random variable 

1- [ ( 2 U- 1 ) s/v(ij-(I- + JL) + ( 7 - 7 )] 


has the normal distribution with mean A/2 and variance one. Accordingly, 
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4>(- 4 ) = Prob {{2U-l)s f\> [4- (— + - — ] + (x-y) <_ 0} 

P 1 n 2 

- E [Prob { (2U- 1 ) s /\) [4- (- — + - — )T + (x-y)_<0 1 x,y ,s }] 

n i n 2 

where E denotes expectation with respect to the set of complete sufficient 
statistics x, y and s. Thus the MVUE of $(" ■j) is 


$(“ 4) - Prob{(2U-l) s (— + — ) ] + (x-y) <_ 0 |x, y, s} . 

n l n 2 


= Prob (Li _< 4 “ (x-y)/2s/i[^-(^— + ) ] |x,y,s} 

n i n 2 

where U has a distribution. Since extensive incomplete-beta 

integral tables are available, (2.2) can be easily evaluated for any given 
values of x,y and s obtained from sample observations. Next, denoting 


W * 2 “ 


x-y 


2s /,(4-(JL + JL)) 


(2.2) can be rewritten In the form 

$(- A) ! 

v 2 . „,v-l v-1 


B (- 


•) 


n 1 n 2 


w 

/ (U(I-U)J 


(v-3)/2 


du < 


3. Relative Efficiencies of MVUE and MLE of $(- -j) 

First we derive formulas for the variance of MVUE, $(-A/2), and the MSE 
of MLE, $(“A/2) , in forms suitable for numerical computations, and then 

t 

present their values as well as relative efficiencies of both types of 
estimates for certain parametric values of A in each of the two cases 
(i) and (i i) . 


( 2 . 2 ) 


(2.3) 
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3.1. R.E. of the HVUE 

Case ( i ) . Noting that v= is distributed normally with mean A and variance 

(- — + - — ) and 
n 1 n 2 


we have 


*(- f) = Prob {2 £ 


n l n 2 


! v > . 


e[*(-|)] 2 = / * 2 (- 


■) 4>( J dv 


£ (i- + J- /T7T 


n 1 n 2 


n l n 2 


1 . 1 




(3-1) 


where $(x,y;p) is the bivariate standard norma! c.d.f. The result in (3-1) 
easily follows by a simple probability argument, e.g. see Zacks and Evens 
{9J. Hence 

Var [♦(- T )1 = ♦(- 2 - - 2 ; + ^ 1 ' * <’ 2> 

and 


(3.2) 


R.E. [i(- f)] = (^ + ±-) <}. 2 (- f)M Var [£(- |)] . (3.3) 

Case ( ? j ) . Letting t = (x-y) /s (~— + -pj— ) ' ^ 2 , a non-central student t variate 

' 2 j 11/2 

with v d.f. and non-centrality parameter A/ (-^ — + -pj — ) , (2.2) can be 

written as 


$(- y) = Prob { ( 2U- 1 ) 


< _ t ZI + H L + !_)] | t } . 

— n n n n ' * 


n l n 2 


Furthermore, letting and be two independent beta variables, each dis- 
tributed as B(^ 2 ^-, ~ 2 ~) and considering W = max(2U^-l J Zl^-l), it follows 


that the density function of W is 

l 


v-3 


fM - ,,,!h 


1 <w<1 


2 J B( 


2 ’ 2 
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where 


T x (a ’ b) ■ bit 


J y a ] (i-y) b ] dy . 


Now we can wr i te 


t/r+T- 

[*(- f) ] 2 - Prob (W < — — 


= F(- 


t/ — + — 
n | n 2 

+ I-)3 

1 2 


t) , (say) . 


E[*(- |)] 2 = f F(- 


t/TTX. 

n ] n 2 

+ !_)] 
1 2 


) g(t;6,v) dt 


where g(t;6,v) is a non-central student t density function with non-centrality 

parameter 6 = A/ (— +- — ) ^ 2 and v d.f. For numerical integration of the 

n l n 2 

right side in (3-4), we have considered the following forms for the functions 


F (w) and g(t ;S ,v) : 


F(w) = — — 2 1 j / y V 2 (l-y) v 2 dy 


(l-w)/2 


^ B (J^i k+ i) ( 1+w) /2 

V 1 2 * k+U f k+v-1,, \V~2_, 

Z R^rwiv y d v 


B( v- 1 , k+l ) J 
k=0 0 


where exact integration was obtained using recursive scheme, and 


g(t,6,v) = 


v^J 2 v/2 r(|) Q 


j y^ X exp [- -6) 2 -y}]dy 
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where the 32-point Gauss-Laguerre quadrature formula was used for numerical 
integrations. The final integration in (3.**) was done using the Rhomberg 
method of numerical integration. All of these evaluations were on double 
precision basis. 


After finding the variance of $(- , we now obtain 


R.E .[*(- ~)] = (7T + ~ + 
Z n ^ n ^ 



•) <t> (- y)/4 Var [$(- j) ] . 


3.2. R.E. of the MLE 

A 

For the MSE and Bias of the MLE, we need to find both E[$(-A/2)] and 

2 

E[$ (-A/2], We will be using the following result of Ellison [ 5 ] in 
our discussion. 

2 

If X is a normal variate with mean p and variance a , then 


E[<t(X)] - »( E— ) . 

XT? 


Case ( i ) : Since -(x-y)/2 a has the normal distribution with mean -a/ 2 and 

var i ance -r(— + ) , it follows from ( 3 - 6 ) that 

* n | n^ 


E[<f(- f)] - *(- 


Again, it can be easily shown that 


E[r(- f)] = *(- 


2 x^ett., 




4 Vn ] n 2 


2 /4(— + — ) 2 / 1 4(— + — ) 

Vnj n 2 ' Vnj n 2 


1 _ 1 

(,+ — f 1 ) ') 

n l +n 2 


Hence 


MSE [*(-.£) ]= 0+ - 2^(€)4>(" f) + * 2 (" |) 


' 1 2 


(3.5) 


(3.6) 


(3-7) 


(3.8) 
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and 


where 


Bias[$(- f)] = *( 5 ) - *(- f) 


(3-9) 


% = - 




- + — ) 
1 n 2 


Next, the relative efficiency of $(- — ) , 

^ A 

R.E.[$(- |)] = (i- + I~) > 2 (- f)AMSE[*(- £)] . (3.10) 

2 

Case ( i i ) . Given s , conditionally (x-y)/2s has a normal distribution with 

1 12 2 

mean (vi,-y„)/2s and variance (— + — )a As . So by applying (3.6), we have 

i ^ n 1 

A /V 

E[*(- f)J = E[E[*(- |)|s 2 ] ] 

- 4 / /i£- ♦ (J- . 


= E [$(-&/) 


(~ + ~) ) ] • 
1 n 2 


2 2 2 

Since Q. = vs /o has a x distribution with vd.f.. 


E[4>(- £)] " E [*(-*/ /?U (TTT^ 


v n. n. 


)] 


(3.11) 


where the expectation on the right side is with respect to the random variable 

0 . 

Next by drawing analogy with the previous result in (3-7), it can easily 


be shown that 

/N 

A 

2 


e [* 2 (- £)J = e[e[$ 2 (- |)| s 2 ]] 


= E[*(- 


/isa +( i_ + i_) /M +(J_ + 1_) 

v nj v 

Accord i ngl y , 

MSE[*(- |)] = E[*(n,n, ^)] - 2E[o(n)]o(- f) + * 2 (- f) 


)3. 


(3.12) 


(3.13) 
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and 


where 


Blasts- |)] = E[*(n)l - *(- f) 


n 55 


.A 


/s 


+ ( — + 
v n j 



(3-14) 


Once aga i n , 

R.E. [*(-|)] - (^-^*7 ( n| ^ 2 - l ) -)* 2 (-f )/ ‘ lHSE ^<-f )1 ' 

Expressions in (3-0, (3-7) are in terms of the standard bivariate 
normal c.d.f. and can be easily evaluated using the method by Owen (1956), 
among others, for any given values of n j , n 2 anc * A. Next, for the final 
numerical integration in (3-11) and (3-12), we employed the Gauss-Laguarre 
quadrature formula. 


(3.15) 


k . Numerical Results 

Certain numerical results are presented in tables 1 and 2 considering 

n l= n 2 =n and specifying values for n and A: n=5, 10, 15,30 and A = .5, 1.0, 1.5, 

2 

2. 0,2. 5, 3-0 in table 1 for a known case and n=5,10 and A=. 5, 1 -0 , 1 . 5,2. 0 , 

2 

2. 5, 3-0, 3- 5 in table 2 for a unknown case. (Due to limited computational 
facilities, we considered only two values for n in the latter case). The 
presented results are mainly designed to- exemplify the comparison of the 
MVUE and the MLE in small sample case. 


2 

For the case of known a (figure 1), bias of the MLE is non-negative for 

2 

all values of A, maximizing at A = 2.0. But when o is unknown (figure 2), 
bias is negative for A equal or less than 2.0 and positive for A greater 
than 2.0. However, in both cases, as one would expect, it decreases uni- 
formly as the sample size n increases and is zero when A=0. 
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Further, interestingly enough, we observe that whether or not a is 
known makes a great difference in relative efficiences of the two types of 
estimates- The relative efficiency of MLE is higher than that of MVUE for 

smaller values of A (i.e. when probability of misclass if ication is higher) 

2 2 
when o is known (Table 1). But reverse is the case when a is unknown 


(Table 2). 



TABLE 1 


A 2 

Relative Efficiencies of MVUE and MLE of ${- — ) for o Known Case 


A 

n 

Var [J(- £)] 

MSE[$(- |) 

• 5 

5 

.01502 

.01371* 


10 

.0071*9 

.00716 


15 

.00499 

.00484 


30 

.00249 

.00245 

1.0 

5 

.01256 

.01172 


10 

.00624 

.00603 


15 

.00415 

.00406 


30 

.00207 

.00205 

1.5 

5 

.00933 

.00899 


10 

.00460 

.00452 


15 

.00305 

.00302 


30 

.00152 

.00151 

2.0 

5 

.00615 

.00620 


10 

.00300 

.00302 


15 

.00198 

.00200 


30 

.00098 

.00099 

2.5 

5 

.00360 

.00384 


10 

.00173 

.00180 


15 

.00114 

.00117 


30 

.00056 

.00057 

3.0 

5 

.00186 

.00212 


10 

.00088 

•Q0095 


15 

.00058 

.00061 


30 

.00028 

.00029 


] R.E.[*(-|)] R.E.[$(- 


■ 9955 

l .0881 

.9982 

l .0044 

.9989 

1 .0297 

• 9995 

1.0149 

.9867 

1.0573 

.99361 

1,0285 

.9958 

1.0189 

.9980 

1.0095 

.9722 

1 .0082 

.9861 

1.0026 

.9907 

1 .0014 

.9954 

1.0005 

-9523 

.9438 

.9756 

.9677 

.9836 

. 9774 

• 9917 

.9881 

.9274 

.8685 

.9623 

.9249 

.9745 

.9475 

.9871 

.9725 

.9014 

.7907 

.9500 

CO 

CT\ 

00 

.9653 

.9144 

.9816 

.9538 


N>| t> > 
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’3 

TABLE 2 

Relative Efficiencies of MVUE and MLE of $(- for Unknown Case 


A 

n 

Var[*(- §■)] 

MSE[*(- |)3 

R.E. [ l (- |)] 

R.E. [$(- f: 

• 5 

5 

.01639 

.01726 

.9438 

.8965 


10 

.00780 

.00814 

.9900 

.9485 

1.0 

5 

.01491 

.01494 

.9467 

.9448 


10 

.00712 

.00722 

.9843 

.9714 

1.5 

5 

.01249 

.01178 

• 9526 

1.0106 


10 

-00597 

.00586 

■ 9851 

1.0027 

2.0 

5 

.00951 

.00849 

• 9576 

1.0730 


10 

.00454 

.00432 

.9845 

1.0340 

2-5 

5 

.00653 

.00564 

.9549 

1.1052 


10 

.00309 

.00289 

.9832 

1 .0514 

3.0 

5 

.00403 

.00346 

.9374 

1.0906 


10 

.00188 

.00175 

.9747 

1.0441 

3*5 

5 

.00224 

.00189 

.8988 

1 .0615 


10 

.00102 

.00091 

.9543 

1.0717 


< <1|CNJ 
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Summary 


Considering the problem of estimating the optimum probability 
of mi sclass if ica t ion under the maximum likelihood discriminant 
rule for univariate normal populations with common variance, we 
give its minimum variance unbiased estimate (MVUE) and derive 
the variance of the. estimate for both cases of variance known 
and unknown. Next, we obtain the mean square error of the maxi- 
mum likelihood estimate (MLE) and compare the two estimates by 
evaluating their relative efficiencies with respect to the Rao- 
Cramer lower bound for variances of any unbiased estimates. Also, 
bias of the MLE is investigated. 



CHAPTER I 


INTRODUCTION 

The Bayesian Solution of the Discrimination Problem 

Consider m populations and suppose that each 

individual in the union of these populations possesses p 

common observable characteristics C 1f ...,C . Let the 

X P 

observed values of an individual be denoted by 
X = (x]_, . . . ,Xp) T , where x^ denotes the observed values of 
Cj. The classical discriminant analysis problem consists 
of formulating a technique to divide the space of observa- 
tions into m mutually exclusive and exhaustive regions 
Rl , . . . , R jjj (hence classifying the individual in population 
n j if X falls in region R j ) in a manner such that the cost 
of misclassifying the individual is minimized. 

There have been various techniques proposed for solv- 
ing the problem, of which the Bayesian solution is optimal, 
in the sense that it minimizes the expected cost of 
misclassif ication. Anderson [l] states the Bayesian 
solution as follows; 

"If qj_ is the a priori probability of drawing 
an observation from population with density 
p^(X) (i = l,...,m) and if the cost of misclassi- 
fying an observation from n as from Ilj is 
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C(j|i), then the regions of classification, 

^l'*** / **m' that minimize the expected cost are 
defined by assigning X to Rj, if: 
m m 

l qiPifXjCtkl i) < l qiPi(X)C(j |i) (1) 

i=l i=l 

for j = l,...,m,j ^ k." 

Estimation of Population Parameters 

In actual situations the values C{i|j) for 

all i, j = 1, . ..,m, and the probability densities p^(X), 

P 2 (X) . . . ,p m (X) may not be known. One solution is to 
assume the populations are equally likely, C(i[j) = 0 for 
i = j and C(i|j) is constant for i ^ j, and the populations 
are normal, that is, when X is from population n^, 

X ^ N(p i ,E ± ) (2) 


where is the mean vector and is the covariance matrix 

of the ith population. In addition, if the parameters yj_ 

and i = l,...,m, are unknown and a random sample of 

(i) (i) (i) 

size n^ (denoted x-^ , X 2 ,..., x n . ) can be obtained 

from the ith population, then and can be estimated 
by the estimators X^ and given by 


n- 


H = t J 


1 v (i) 

3=1 j 


(3) 
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, Pi(X) 

= Pi (X,y i# Zi) 

(5) 

by 

Pi(X) 


(6) 

or 

-E -I 




Pi (X) = ( 2 TT ) 2 I Xil 2 

exp[-|(x-Xi) T z " 1 (x-Xi) ] 

(7) 


An individual with. the observation vector X = X 0 will be 
classified as belonging to the jth population if 

max pi ( Xq ) - p j (X Q ), x = 1,2 , . . . ,m. (8) 



CHAPTER II 


DIMENSION REDUCTION AND VARIATE SELECTION 

Misclassif ication as a Function of Separation 

If p, the vector length, is large, the problem of 
classifying individuals becomes quite cumbersome and time 
consuming [7,9], For example in the analysis of remote 
sensing data [10], when p = 12 the amount of computational 
time is immense. Thus it is desirable to reduce p while 
maintaining near optimal results. One technique is to 

pick an acceptable number of characteristics, say s, si p, 
based on an examination of a measure of the separation 

between the populations, and proceed with Bayesian classi- 
f ication procedure using the s-variate marginal densities. 
Provided the separation is held constant or nearly so, it 
is feasible that fewer than p variates may be used for 
classification purposes without significant degradation of 
accuracy. 

In the case where m = 2 and and n 2 ate normal 
populations with unknown parameters y 2 and 

= £2 = the classification procedure (8) can be 
replaced by an equivalent discriminant function [l] 

v = x T r 1 (x 1 -x 2 ) -|<x 1 +x 2 ) t e “ 1 (x x -x 2 ) 


( 9 ) 
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where X^ is given by (3) and 

2 n- 


I = 


n - + * 2 I I 1 tx^’-Xi) (xji’-Xi) 1 . (10) 

n l +n 2 2 i=l j-1 1 J 


V is asymptotically distributed N(-ct/2,a) when X 'b N(y 2 ,z) 
and N(a/2,a) when X ^ N(y^,E) , where a is the Mahalanobis 
distance between the populations given by 


« = (yi-V2) T S" 1 <y 1 -u 2 ) ♦ (11) 

Thus the probability of misclassif ication is asymptotically 
dependent on a and not on p. Hence the procedure to select 
s < p in a manner so as to hold the separation, a, relatively 
constant, and then to classify on the basis of s variates, 
is reasonable. 


Development of Divergence (A Measure of Separation) 

For the two population case, ly f 6 E 2' fullback [5] 
uses the divergence between the populations as the measure 
of separation or difficulty of discriminating between the 
populations. Defining the logarithm of the likelihood ratio, 
log tPl ( X 0 )/P2 ' to b® t * ie information in X = X Q for 

discimination in favor of against n 2 , the mean infor- 
mation for discrimination in favor of against is 

J 12 = / Pl( x ) log - 1 dx . 

P 2 (X) 


( 12 ) 
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The divergence between the populations, J, is defined by 

J = X 12 + I 21 • (13) 

In the case where and n 2 ar s s-variate normal 
populations , 

0 

s -i , 

Pi (X) = (2ir) ^\l L \ 2 exp[-i(X-y i ) T zi 1 (X-u i )] (14) 

for i as 1,2. Then 

log = i log 1 Z - 2 J - - i tr (X-^) (X-y x ) 

, p 2 (X) 2 IjJ 2 1 

(15) 

+ | tr Z2 1 (X"P2) ( X- y 2 ) T / 



(16) 


(17) 
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or 


J 2 tr ^ Z 1 Z 2^ ^ Z 2 “ Z 1 ^ 


1-1-1 T 

+ 2 ^1 + ^2 ^ (Ui“ ^2^ ( V ]_ — ^ 2 ^ 


(18) 


Assuming equal covariance matrices, ~ I 2 = 2, 

1-1 T 

J - - tr [2E ^i - ^ ( u 2 _"~ v« 2 5 ] 


(19) 


or 


T -I 

J = (v^-U 2 ) ^ (yi - l J 2^ 


( 20 ) 


That is, if Ejl = Z 2 ” 

J = a (21) 

where a is the Mahalanobis distance between the populations. 

It may be noted in (18) that the first term of the 
expression for divergence is due to the difference of the 
covariance matrices, and the second term is due to the dif- 
ference of the means. In actual situations Z^ / z 2 , however, 
one may choose to ignore the difference and compute a instead 
of J, Then a value to substitute for Z in (11) or (20) is 


Z 


lr^2 

2 


( 22 ) 


This assumption is made within this report, and the conclu- 
sions of Chapter IV indicate it is reasonable. 



Sample Size Considerations 

If the parameters of and JI 2 are unknown but esti- 

mated on the basis of training samples of sizes n-^ and 
respectively/ then the choice of s may be heavily dependent 
on the values n^ and n 2 . Consider 1) if E]_ = Z 2 = E f 
(n! + n 2 )p measurements are used to estimate the 
2p + (p 2 + p) /2 distinct elements of X^/}^ and E ; and 2) if 
^i ^ £ 2 ^ (ni + n 2 >P measurements are used to estimate the 
3p + p^ distinct elements of Xi/X 2 ,Ej_ and E 2 . In either 
case, the number of elements to be estimated increases as 
a function of p^. This suggests that for small sample sizes 
the probability of classification may actually be improved 
by considering fewer variates. The results contained herein 
support this hypothesis. 
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CHAPTER III 

SIMULATED POPULATION CLASSIFICATION 


Classification Criteria 

Anderson [l] discusses the classification when m = 2 
and and n 2 are normal populations with equal covariance 
matrices. The classification procedure (8) can be replaced 
by an equivalent discriminant function 


V = X T r 1 ( Xl -x 2 ) -f (X 1+ x 2 ) V 1 (Xi-Xa) {23) 

A 

where X^ and £ are given in (3) and (10) , respectively. If 
n l = n 2 ~ n then the distribution of V if X is from is 
the same as the distribution of -V if X is from n 2 * Thus 
if V ! 0 is the region of classification as n^, then the 
probability of misclassifying X when it is from is equal 
to the probability of misclassifying X when it is from n 2< 
Since V is asymptotically distributed N {-a/2, a) when X is 
from H 2 , P ( 1 [ 2 ) , the probability of classifying an observa- 
tion from n 2 in n^, is approximately 


Similarly 



1 

/ 2 ^ 


- (v+ct/2) ^/2ci 
e dv 


P(2|l) 



- (v-a/2 ) 2 /2a 
e dv 


(24) 


( 25 ) 


and P (2 1 1) = P (1 1 2 ) . 
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In the event 7 * Z 2 / P]_(X) and p 2 (X) are estimated 
on the basis of training samples of sizes n^ and n 2 , 
respectively. X is classified in n 2 if 

p 2 (X) > Pi(X) (26) 

and in Jl-^ otherwise. Notice it is not necessary to compute 

Pl (X) and p 2 (X) since (26) is equivalent to 

1 1 

|Z 2 | i 2 exp[-z 1 /2] < i^j 2 exp[-z 2 /2] (27) 

where 

_ T A -1 " _ 

Z ± = (x-x ± ) Z ± (X-X ± ) , (28) 

i = 1,2. Taking the natural logarithm of (27) and multiplying 
through by 2 gives the classification rule: Classify X in Jl 2 

provided 

log (det £jl)-z 2 > log (det Z 2 )-zi (29) 

and in III otherwise. This rule is equivalent to (8) and (26). 

Monte Carlo Simulation Procedure 

If the populations n ^ and II 2 are assumed to have equal 
covariances matrices, the probability P(2jl) can be estimated 
from (25) provided the sample sizes ni and n2 are large. 

For small (and moderately large) values of n^ ** n 2 = n, the 
probability P ( 2 f 1) can be estimated by Monte Carlo methods. 

The process involves taking a sample of size n from each of 



.the two normal populations. These samples are used to 

_ _ A 

compute X^,X 2 and 2 according to (3) and (10) . A sample 
population of 50 is generated from ^ N(yi,E) and 50 
values of V are calculated. The probability of classifying 
an individual from the population as belonging to H 2 is 
then estimated by 

P (2 1 1) = k/50 .(30) 

where k is the number of values of V < 0. The sequence 
[training samples - estimation - population - classification 
P(2j 1) ] is repeated 50 times, and the mean of the 50 values 
of P(2jl) is used as the estimate of P(2jl). The results 
of Test Case 1 were obtained in this manner. 

When the covariance matrices of Jt-j. a nd n 2 are not 
assumed to be equal, the procedure ' for estimating P(2.|l) 
is basically the same as the preceding discussion. Dif- 

A A 

ferences arise in the estimation of and E 2 and in the 

A 

classification procedure since V is no longer valid. E^ 
and E2 are computed according to (4) . The sample population 
of 50 is generated from Hi ■v N(yi,Ei) and the 50 members are 
classified as belonging to n or H 2 according to (29). Then 

P (2 1 1) = k/50 - (31) 

where k is the number of times a member of the population 

is classified as belonging to n 2 - Again the sequence 

- 

terminating in the computation of P(2jl) is repeated 50 



times and the mean of the 50 values of P(2|l) is used as the 
final estimate of P ( 2 | X) - 

When ? ^ 2 1 P(2|l) ^ P(l[2). However, the estimation 
of P(l[2) is accomplished by changing the order of the input 
of the parameters of the populations. The results of Test 
Case 2 were obtained in this manner. 



CHAPTER IV 


RESULTS AND CONCLUSIONS 
Test Case 1: Feasibility 

To demonstrate the feasibility of selecting a subset 
of the variates while holding the separation measure constant 
consider the populations and n 2 and the simulation results 
given in the following table: 

TABLE 1. TEST CASE 1: FEASIBILITY 


Number of 

ui=o 

Max 

a 

Variates 

^2 

a 

P(2|l) 

1 

(1) 

1 

.318 

2 

(1,1) 

2 

.256 

3 

(1,1,1) 

3 

.216 

4 

(1,1, 1,1) 

4 

.171 

5 

(1,1, 1,0,1) 

4 

.192 

6 

(1,1, 1,0, 1,0) 

4 

.195 

n x -v n ( 0 , 1 ) 

y 2 

= d,l 

,1,0, 1,0) 

n x ^ N ( y 2 , 1) 

n l 

= n 2 = 

15 


Table 1 illustrates a hypothetical situation where the 
maximum separation, a, for the populations considered as 
4-variate, 5-variate, and 6-variate populations is the same. 
Thus, if the training samples are large, an individual 
could be classified on the basis of four characteristics, 
namely C^,C 2 ,C 2 and.C^, with no degradation of accuracy. 
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For small sample sizes, in particular nj_ = n 2 = 15, the 
simulation indicates that sample size considerations are 
significant, and better results are obtained using only 
4 variates. 

Test Case 2; Remote Sensing Data 

To test the classification procedure using selected 
variates under meaningful conditions, actual data from a 
12-channel sensor typical of those used in remote sensing 
of agricultural crops was obtained from NASA and used as 
population parameters for simulation purposes. H 1 was a 
composite of soybean fields and is represented by and 
El as given in Table 2. n 2 was a composite of corn fields 
and is. represented by U 2 an ^ s 2 as given in Table 3. The 
divergence J for all combinations of 12 channels taken s 
at a time (s — 1,2,..., 11) was computed by NASA. As many 
as 50 of the largest values of J were printed for each 
value of s and the values were included in the data package. 

For purposes of comparing separation measures, a was 
computed [assuming z / 2 1 for s = 1/3,6 and 12. 

The maximum and minimum a and the maximum J for the various 
values of s are plotted in Figure 1. The two distance 
measures are not readily comparable; however, the data 
does demonstrate the existence of combinations of fewer 
than 12 channels which yield almost the same separation as 



Mean 


169. 56 

174. 76 

193. 24 

192. 64 

169. 11 

166. 80 

190. 56 

171.49 

185.49 

173. 79 

160.94 

181. 18 

Covariance 

Matrix 











19. 08 












14.47 

14. 28 











9.01 

7. 52 

5.91 










9.40 

8.49 

5. 12 

6. 50 









16. 77 

14. 59 

9. 00 

9.86 

19.23 




. 




13. 73 

11. 98 

7, 24 

8. 01 

14. 22 

12. 84 







7. 97 

7. 01 

4. 68 

4. 90 

8.44 

7. 15 

5. 31 






13.79 

12. 62 

7.72 

8. 63 

15. 12 

12. 30 

7.42 

15. 17 





10.43 

10. 02 

6. 00 

7. 00 

12. 08 

9.82 

6. 14 

11.47 

10. 88 




9.46 

8.69 

5. 68 

6.28 

10.99 

9.26 

5. 97 

10. 32 

8'. 57 

10. 10 



-3. 02 

-3.48 

-1. 33 

-2 . 19 

-2. 81 

-1. 10 

-. 20 

-4. 14 

-4. 02 

-. 84 

23. 59 


-4.73 

-4. 11 

-2. 73 

-2. 75 

-4. 24 

-2.69 

-1. 67 

-4. 11 

-3. 23 

-1.43 

6. 55 

9.70 


Table 2: MEAN AND COVARIANCE FOR SOYBEANS 
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Table 3: MEAN AND COVARIANCE FOR CORN 
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the full complement of channels. In particular, regardless 
of the choice of separation measure, very little increase 
in maximum separation is obtained by considering more than 
5 channels. 

To further study the equivalence of J and a as the 
criteria for the selection of variates. Table 4 was con- 
structed. Table 4 contains some of the values obtained for 
J and a for various subsets of variates. For a single 
variate, the largest three values of J arise from consider- 

J 

ing the same three variates or channels that yielded the 
largest values of a, and in the same order. 


TABLE 4. SEPARATION AS A SELECTOR OF SUBSETS 


Subset of Channels 

Values of a 


Values of J 

(10) 

7.8* 


16** 

( 9) 

7.0 


15 

( 8) 

4.8 


10 

( 6,10,12) 

12.0 


32** 

( 6, 9,10) 

12.4* 


31 

( 1,10,12) 

12.3 


30 

( 4, 6, 9,10,11,12) 

13.5 


38** 

( 4, 6, 8, 9,10,12) 

13.6 


38 

( 4, 6, 7, 9,10,12) 

13.5 


38 

( 1, 4, 6, 9,10,12) 

13.6 


38 

{ 1, 4, 8, 9,10,12) 

13.7* 



* largest a for fixed number 

of 

channels 

** largest J for fixed number 

of 

channels 


For the three variate case, the best three subsets by J 
measure were the best three subsets by a measure, though 
in different order. In the case of 6 variates, from the 



924 possible subsets of channels, the best three by J 
measure were not among the best three by a measure; however, 
the best three by J did yield values of a significantly 
close to the maximum a. 

The conclusion from Figure 1 and Table 4 is that either 
of the separation measures between populations, divergence 
and Mahalanobis distance, is a suitable selection criteria 
for the desired subset of variates. 

Figures 2, .3, and 4 were constructed from simulation 
data based on training sample sizes of 15. They support 
the theory that P(2|l) is a decreasing function of a. In 
most instances P(2[l) decreases as the distance measure 
increases. The relation is consistent with what could be 
expected from the asymptotic theory. Figure 2 is not too 
coherent because there are exactly 12 data points to con- 
sider. It does, however, indicate the trend. Figure 3 is 
very consistent in support of the relation. Figure 4 shows 
considerable scattering. If one looks ahead to Figure 6, 
the reason for the scatter in Figure 4 is evident. The 
sample size was too small. A training sample size of 30 
would probably have produced a more consistent figure. 

Figures 5 and 6 are the significant figures of this 
report. They represent the same data presented from two 
different viewpoints. Figure 5 shows that for small fixed 
training sample sizes, there is a subset of fewer than 12 



variates which yield the minimum probability of misclassifi- 
cation. Figure 6 holds p constant and varies the training 
sample size. This figure shows that, for the populations 
considered, if one is constrained to n - 11, one might just 
as well look at a single variate. For 11 < n < 30, 3- 
channel data is adequate, and n must be somewhere in the 
neighborhood of 100 before 12-variate data can be justified 
over 6-variate data. 

The conclusion of this report is that for small sample 
sizes, a subset of the variates can be chosen so as to yield 
better classification results than the full complement of 
variates, and the selection of variates can be accomplished 
by considering the separation between the populations. 
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ACREAGE ESTIMATES FOR CROPS USING REMOTE SENSING TECHNIQUES: 
CLASSIFICATION ERROR MATRIX KNOWN 

1 . Introduction 

Remote sensing technology has shown a great potential for data collection 
for the earth resources in any given geographic region. As a result of this, 
it may be feasible to assess various earth resources and, thereby, answers 
to some of the important yet previously unsolvable problems may be available. 
At present we address ourselves to the problem of estimating crop sizes, i .e. , 
amount of acreage under crops, in a large area using remote sensing technique. 
Even though it’ may not be difficult to obtain full data over an area by using 
remote sensors, crop acreage estimation on the basis of complete enumeration 
of data points in the area may not be feasible both from technical and 
economical viewpoints. As such it would be desirable to derive estimates 
based upon sampled data acquired by a suitable sampling process. 

In this report we discuss a sampling scheme providing crop acreage 
estimates in a given geographic region, and investigate the precision of 
these estimates and the effect of misclassif ication on the estimates. Since 
observations are obtained on individual pixels, we consider the frame made 
of the collection of all pixels in the region of our interest. As to the 
actual layout for various crops, two cases arise. One is that no informa- 
tion is available and the other is that some a priori information on the 
physical situation of crops exists. For example, if the area of interest 
is small like a county or certain areas covering V/estern Oklahoma, it is 
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possible to have some prior information for the location of different crops. 

In our discussion we consider each of these cases and point out the difference 
that exists in precisions of two estimates. 


2 . Formulation of the Problem 

Let C 1? C 0 ,...,C be m different crops and be their corre- 

12 m 12 m 

sponding acreage areas in a region. Assume the whole region consists of N 

total number of pixels. Since a pixel on the basis of observation taken by 

a remote sensor is subject to uncertainity in its correct ident i f icat ion, 

let P ( i | j ) denote the probability of mi sc 1 as s i fy i ng a pixel from crop 

into crop C., and P ( i f i ) denote the probability of correctly classifying a 

pixel into crop C.. Accordingly, the expected acreages associated with C. 

on the basis of remote sensing data is 

m 

E, = Aj p(i f j) , (2.1) 

i 1,2,*.*, m • 


Given the total number of pixels and a pixel size, it is sufficient to 
consider proportions of pixels associated with different crops in the region. 
Let pj ,p 2 ? - • * be the proportions of pixels for C ^ , C^, . . . , C^, respectively. 
Then the expected proportions of pixels likely to be identified in C. , 
i = 1 >2 are 

m 


e . 

i 



Pj P(i !j) • 


( 2 . 2 ) 


j = l 


e. coincides with p. , 1=1 ,2, . . . ,m, when every pixel is correctly classified. 
Since it is almost Impossible to expect so in remote sensing, in our discus- 
sion we first consider estimation of e., i-1 , 2 , . . . ,m, and then seek estimates 
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for p , i=l,2,...,m, assuming that P ( i | j ) > 0 at least for one j different 
from i. This will be done considering two cases: (a) P(j|i)'s are known 

and (b) P(i[j)'$ are unknown. 

3 . The Sampling Procedure 

As we mentioned earlier in Section 1, we consider the following two 
cases : 

(i) Crop boundaries (i .e. the physical layout for crops) are unknown. 

(ii) Crop boundaries are enhanced and so known. 

(i) In this case, a sampling procedure which appears to be useful is a two- 
stage sampling scheme. At the first stage a specified number of flightlines, 
each of the same size, are randomly selected, and at the second stage an 
equal number of units (pixels) are randomly selected from each of the selected 
flightlines. However, in order for this scheme to be useful, it may require 
a fairly large number of flightlines to be selected. Otherwise, a simple 
random sampling with single-stage, though more difficult to execute, may 
produce estimates with smaller variance than the two-stage random sampl i ng 
scheme . 

(ii) When the physical layout for the crops is known to a certain degree, 
consider the region made of subregions, each consisting of flightlines 
covering area as homogeneous as possible. This could allow more than one 
crop in a subregion, and such subdivision should be restricted to a minimum 
possible number of subregions. For an illustration, see Figure 1 given at 


the end. 
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For a sampling procedure, we suggest a three-stage sampling scheme 
carried out in each subregion. At the first stage flightlines are randomly 
selected proportional to the size of a subregion; at the second stage blocks 
(or segments) are randomly selected within a flightline proportional to the 
size of a crop in the flightline; and at the third stage units within blocks 
are randomly selected in equal numbers. The number of blocks selected in 
each of the sampled flight! ines is the same. Here by block we mean a 
sampling unit of specific size within a flight line. It may be pointed out 
that careful consideration should be given in specifying the size of a 
block. Neither should it be too small for any practical use, nor should 
it be too large to make any distinction for the crops when sampled. This 
way at the first two stages the sampling is proportional to the size of the 
underlying strata (i.e., subregions or crops, whichever may be the case) and 
at the last stage it is a simple random sampling. Though one may expect 
accumulation in sampling error due to three stages of sampling, it should 
lead to a smaller variance of the estimate for e. ( i=l ,2, . . . ,m) compared to 
a single-stage or double-stage sampling scheme. 

Expected Proportions Estimates 

(i) Let N denote the total number of flightlines for the region and let n 
flightlines out of N be randomly selected. Suppose each flightline consists 
of R units from which r units are randomly sampled for each of the sampled 
flightlines. In the sample, let m. denote the number of units from tth 
selected flightline classified into C.. Now denoting 



t = 1 ,2,. .. ,n, 



an unbiased estimate of e. is given by 

n 

m. t , i=l ,2,. . . ,m . 
t=l 



(*».!) 


Since the total size of units, NR, is expected to be large in relation to the 

/\ A A ^ 

sample size, nr, the random vector e = (e^ . . . ,e ) has a multinomial dis- 
tribution, usually a satisfactory approximation In such a situation. So it 
follows that the variance of e., 


N N 

V (e. )=( 1- -) n ( N _ 1 ) ^ + R^TTF N(R-l) ^^ e it^~ e it^ 

t=l t=l 


and an unbiased estimate of V(e.) is given by 


^ " N^ n(n-l) ^ e i t~ e i ^ + ^~ R^ Nn(r-l) ^ e ft 


n, , 1 


(l-e. t ) (4.3) 


t=l 


t=l 


where e- t denotes the expected proportion of C. in tth flight-line and 
N 

e. = e. t /N, the same as defined in (2.2), (Cochran, 1963 ). 

t^l 

(ii) Again, let N denote the total number of flightlines, each consisting of 
R units. Suppose the region is divided into k subregions, Rj .R^, • . •. , R|^ , and 
are the corresponding number of flightlines for R^ , R^, . . . ,R^ 

such that 

j = 1 



Considering a sample of n flightlines out of N, 

randomly selected from N. in R. such that n. = 

J J J 


n 


let n . 


J 

(N./N) 


f 1 i ght 1 i nes be 
j=l ,2, . . . ,k, and 
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n = 


k 

E 

j=i 


n . . 
J 


For subregion R., let B.. be the number of blocks covering i th crop 
J J it 

in the tth sampled flightlines; from this let b . ^ blocks be randomly 


samples such that 


B - 


E 

i = l 


B . . , 

J't 


b = 


2 b ju 

i-1 


and 

b • • * “ *-» B » * j 

j 1 1 B j 1 1 

t *“ 1 , 2 , . . . ,n j , 

i “ 1 j 2 j • • » j m » 

Considering block size L, suppose 1 units are randomly selected for each 
sampled block. Thus the sample size for subregion is n^bl and the complete 
sample consists of nbl units. 


Let iK j t denote the number of sample units being classified in C. for 


tth flight line selected from R y Then 


n . 


6 j i nbl U j i t 

J t =l 


(b.k) 


is an estimate of e j j » the expected proportion of units in C. for subregion 
R y and j=l , 2, . . . , k. Thus 


1 X A 

= I / 4 M J e j i * 1 = 1,2 

j = l 


(A. 5) 


Next, 


V(e.) = 


E 


j= 



100 


where 


- e. . (1-e. .) 

Vie..) = — ( I- S) ' J' . 

j i n N N . - 1 

J 


assuming the same variance in each flight] ine of FT. for a fixed C. . Thus 


k 2 


,(e i> "T3f 0 H 


N. 

8> 2-r e ji < ! V > 

rrr -j 


j=i 

i 1 , 2 , * . * , in * 


5 . Actual Proportions Estimates 


Denot i ng 


e - 


~ e l 


" p i 

e 2 

> p = 

p 2 

• 


* 

e 


p 

m 


_ m 


and 



P(1 | 1) 

p(M 2 ) • ■ • • 

P(1 j m) 

a - 

P(2|l) 

P (2 | 2) .... 

P (2 ! m) 


P(mjl) 

P(m|2) .... 

P (m | m) 

be wr i tten i 

3 S 





e = Qp . 



Observe that Q. may be a singular matrix. In view of this the equations 
in (5-1) may or may not have a solution for p for specified e, Q. In 
case any solution exists, it is given by 


(4.6) 


(5.1) 
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p = + p Q (5.2) 

where is a generalized inverse of Q. matrix satisfying 

QQ 1 ^ = Q (5-3) 

and Pg satisfying Qp^ = 0. 

In the second case when no solution exists, we look for the vector p 
that minimizes the squared length of e-Qp for a given e, (J. Any such vector 
is given by 

p = Q S e + p o (5-4) 

S 

where Q is a right pseudo inverse of Q. matrix satisfying 

m S ) J = QQ S (5-5) 

and Pq satisfying Q,p^ = 0. 

Since more than one p^ may satisfy the condition Qp^ = 0, in either case, 

a unique solution for p may not exist. Accordingly, either one would have a 

complete set of solutions p given by (5*2) or a complete set of vectors p 

which are a "best-fit" when obtained from (5-4). The techniques for finding 
T S 

Q~ , Q and are well known (Boullion and Odell, 1972). Henceforth, by a 
solution we mean in either sense and will denote i t by 

P = a G e , ( 5 . 6 ) 

assuming = 0 without loss of generality. 


5-1- Q, known 

A 

(i) Taking the estimate e given by (4.1) and the known value of Q, it follows 
from the above discussion that an estimate of p is given by 


; - Q G e . 

G G 

Denoting the (i,j) the element of Q by q . j , we have 


(5.7) 
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P i = 


m 

E 

j=i 


G * 
q . . e . 
U J 


i=l ,2 , . . . ,m. 


(5-8) 


These are unbiased estimates of p., i=1 ,2, . . . ,m. For the variance. 


V(p 


i> - £ h?j> 2 + £ £ 


A A 


q^ . q^.. Cov(e.,e.,) 
U U J J 


j=l j'=1 
j^j ' 


where V(ej) is given in (4.2) and 


(5.9) 


c ov (e. , e . , ) . (1- J) ^, T ^2 (e jt -e.)(e. lt -ej,) 


J J 


t=l 


^ R^ Nr N ( R- 1 ) ^ j e 7^ e j' t > (j^j') 


Jt 


t=i 


Next, an unbiased estimate of V( e j) * s given in (4.3)* Similarly an unbiased 
estimate of Cov (e^ej,) can be shown as 


ru 1 


A A 


Cov(e.,e.,) = (1- ^ (e jt‘ e j ) (e j 't~ e j ■> 


J' J 


t=l 


" (1 ' ntit-T) 22 e jt e j't» (jVjl) • 

t=i 

Replacing the unknown quantities in (5-9) by their estimates, one thus 
obtains an unbiased estimate of V(p.), i =1 , 2 , . . . ,m. 


(ii) in this case using the estimate e given in (4.5) we obtain 

P - Q°e 


(5-10) 


or 
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P i = 


E 

j=l 


G " 
q. . e. 
IJ J 


i = l ,2, . . . ,m 


(5.11) 


Once agai n, 


V(p.) - 


m m 

(q^j ) 2 V (ej ) + ^ ^ y q^q^j ,Cov(e.,e. ,) ( 5 - 12 ) 

j = l W J'“l 


m 

E 


• j 


jvj 1 


where V(ej) is given in (14.6) and 


A A 


1 


Cov(e.,e.,) - - m (1 


\ ■> N. 

- H) \ __J_ 

fr / ; N.- 

i = l J 


e. . e. . , 

ij ij 


A A 


Now replacing V(e.) and Cov (e.,e.,) by their estimates, we can get an 
j j j 

A 

estimate of V(e.), i=l,2,...,m. 


5.2. Q. unknown 

In this case one also needs to estimate P ( i | j ) , i and j = l,2,...,m. An 
obvious way to do so is to ascertain the ground truth for a certain number of 
additional observations taken independently of those used for estimating e. 's 

A A 

and utilize these to estimate P ( i J j ) 1 s . In view of this e and Q. will be 


stochastical ly independent and an estimate of p is given by 
P - e 


(5-13) 


or 


m 

E A pi A 


i = 1 , 2 , . . « , m . 


(5-14) 


j =i 

A G aQ 

where is the (i,j)th element of Q. . 
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For a given Q. , the conditional variance of p. is given by (5-9) in 

case (i) and (5*12) in case (ii) with q..'s replaced by q..'s. For the 

■ J i J 

unconditional variance, the expression will involve the variances and co- 

X Q 

variances for the elements of Q . In general i t wi 1 1 be difficult to express 
the unconditional variance explicitly. 
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Figure 1 : A subdivision of a region with five crops 

for a three-stage sampling plan. 
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Estimation of Crop Acreage Through 
Sampling of Remotely Sensed Data 


R. S. Chhikara and P. L. Odell 
The University of Texas at Dallas 

1 . INTRODUCTION 

In this report we consider the problem of estimating crop acreages 
in an area using samples from remotely sensed data for the area. Ration- 
ale for using samples is to avoid enormous cost and time that might be 
involved otherwise if the full data is processed. In particular, such 
would be the case for a ERTS scene which generally has over half a mil- 
lion data points and covers approximately an area of 100 X 100 square 
miles . 

While considering crop acreage estimation, it is desirable to as- 
sume that the underlying region is an agricultural area and that every 
data point is identifiable with respect to certain known types of crops. 
In a larger context of an arbitrary area, it is sufficient for the con- 
dition of identification to hold with respect to the underlying known 
earth resources. 

To formulate the problem, let n lt n 2 , ..., n m denote the m crops in 
the area and p x , p 2 , ..., p^ be the proportions of their acreages. Next, 
let P ( i / j ) be the probability of classifying a resolution element (pixel) 
from jin into n. using a classification algorithm. Then associated with 
such classification algorithm processing of full remotely sensed data 
would amount to expecting proportion of acreage given by. 
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m 

e, = e p, P(i/j), i, j, 2, m (1) 

j=l J 

Equivalently, 

e = Pp (2) 

where 



r e ii 


r p ii 



e = 

e 2 

m 

m 

, P = 

p 2 

« 

and P = 

~P(1/1 ) P(l/2) . . . P(l/m)' 


* 


♦ 


P ( 2/1 ) P ( 2/2 ) . . . P(2/m) 


» 

e ~ 
m 


• 

r m 


P(m/1 ) P(m/2} . . . P(m/m) 


If the vector e and matrix P are known, one gets p by. solving (2) sub- 
ject to ip^ = 1. Otherwise one needs to consider estimation of e or P 
or both of these as the case may be in order to ascertain about p. 

In general, e will be unknown. An estimate of e is given by the 
vector of observed proportions of resolution elements processed and clas- 
sified into n. , i = 1 , 2* . . . , m using sample data obtained under a 
sampling plan. Regarding P, two cases arise: 

(i) P is known 

(ii) P is unknown 

Again, P will generally be unknown and for an estimate of P one would 
require some suitably selected amount of the ground truth, probably in- 
dependent of the previous sample data used in estimating e. Moreover, 
care should be exercised in handling of the sampled ground truth. Most 
often, it would be desirable to use the sample for both training the 
classifier and obtaining estimate of P. 
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Clearly how much complex the estimation problem is depends upon how 
much one knows about P. In an entirely unknown situation of classifica- 
tion with untrained classifier, estimation of acreage proportions can be 
very misleading. As such we would assume for the classifier to be proper- 
ly trained and thus are able to achieve reliable estimate of P. 

2. PROPORTION ESTIMATES 

A 

Let e be an estimate of e and P of P. Then an estimate of p is given 
by 

(i) p = P e when P is known (3) 

and 

.... s -_i ~ m 

(li) P = P e when P is unknown v 

In case (i) p is an unbiased estimate of p whenever e is an unbiased es- 
timate of e. However, if estimate for both e and P are involved in esti- 
mating p as in case (ii), it may hardly be possible to obtain an unbiased 
estimate of p in p. Henceforth, we will assume e, and so also p, to be 
unbiased estimate and it is the vector of observed proportions of resolu- . 
tion elements processed and classified into different n.. 's (This will be 
the case due to the sampling plans being considered in this report.) 

Also, we would restrict our investigation to case (ii)"and for case (i), 
refer to Chhikara and Odell [ i ]. Though final results in [1 ] pertain 

to specific sampling plans, the discussion there is of general nature 
and results can be adopted in any sampling situation. 



3. MEAN SQUARE ERRORS OF ESTIMATES 


First we calculate the bias of p given by 
Bias (p) = E [p-p] 

= E CP _1 e - P _1 eJ 
= E Cp" 1 (e-e) + (P -7 - p-^e] 

= E EP- 1 - P"^e (5) 

because the first term is zero due E(e-e) = 0 for a given P~ 7 . Clearly, 
the bias depends upon how much bias there is in P -7 , and 

Bias {p} = (Bias (P~ 7 })e. (6) 

A 

In order to find the mean square error of any component of p, we 
first consider the evaluation of matrix, 

E C(p-p)(p-p) T ] = e C(P _1 e - P~ 7 e)(P _7 e - P" 7 e)*3 

= E E(P" 1 )(e-e)(e-e) T (P“ 1 ) T + (P‘ 7 -;p' 7 ) ee T (P -7 - P~ 7 ) T ] 

= E [(P* 1 ) M(P" 1 ) T 3 + E C(P- 7 - P- 1 ) ee T (P" 1 - P" 1 )^ , ( 7 ) 

where M denotes the covariance matrix of e. 

Denoting the (i,j) th element. of P -7 by P 1J ' and that of P" 7 by 
P it follows from (7) that the mean square error of fi. is given by 


E 


m 

£ 

k=l 




Cov(ej,e k ) 


+ r e k (P -P u ) s (P ,J -P 1J ) e. 
k-1 j=l 11 

A 

where E stands for expectation with respect to P., Denoting it by MSE (p->, 
we have 



Ill 


MSE (p ± ) = E 


■p m 

I 

Lj=i 


. *. m 

♦T-; 

j=i 


2 E [P 1;5 


(P ^) 2 Var (e )+Y V P lj P lk Cov (e.,e ) 

3 j k 

j=l k=l 

m ; rn 

P 1 ^ ] 2 + ^ ^ e b^E [ (P 1 j -P 1 ^ ) (P lk -P lk ) ] 

3=1 k=l 
j^k 


(8) 


i ^ 2 ^ • ♦ « 3 m» 

/V 

The procedure for obtaining P would involve sampling of ground 

truth independent of samples taken for estimating e. In the following 

section we give expressions for v(§.) and Cov (e., e.) considering dif- 

^ 1 3 

ferent sampling plans. To evaluate expectation in (8), one needs to 

a 

find the distribution of P. This will, of course, depend upon how P 
is obtained. In general, it will be difficult to obtain any exact dis- 
tribution of P. However, if the sampling of ground truth involves sep- 

a 

arate independent samples from each crop and P is obtained as the ma- 
trix of observed proportions among randomly selected pixels classified 

A 

into different crops using a classifier, each column vector of P has a 
multinomial distribution and is stochaotically independent of the others 

A ^ m 

in P. Since expectation in (8) is for elements of P , it may not be 

A 

easy to derive the MSE' {p.} in a closed form, especially if the number 
of classes is large. As such we now consider the two-class case and 

A A 

show the procedure for obtaining Bias (pj and MSE (pj, 1=1, 2. 


Two-class Problem 

Often interest lies in ascertaining acreage of a specified crop in 
an area. In view of this one may consider Jii to be the crop of main in- 
terest and n 2 to be its complementary part. Without loss of generality, 
let us assume density functions for iij and n 2 to be symmetric about certain 



112 


location points. ' Then P(l/2) is same as P(2/l). Denoting this common 
val ue by one gets 


[ 1-<i> $ “1 

$ i-<j>J 


It now follows from (2) that 

e l = P 1 + ( p 2 " p l) * 
and 

e 2 = P 2 + (P-, - P 2 ) * , (e 1 + e 2 = 1) 

Assuming $ f 1/2, we have 

_ e, $ 
p, „ i 


1 - 2 


and 


„ e 9 - o 

Po - 2 


1 - 2 o 

Considering e^ , e 2 and £ (i f 1/2) estimates for e-j , e 2 and respec- 
tively, obtained according to the procedure outlined in the previous 
paragraphs, one has 

A A A 

£ e, - $ 

p, = j — 


1 - 2 


A A 


- e 0 - $ 
P 2 = _2 


1 - 2 


Clearly, p-j + p 2 = 1 because of e-j + e 2 = 1 . Next, it follows that for 
k = 1, 2, 

A 

Bias (p k > = (e k - 1/2) (E[T] - e) 
and 

A 

MSE {p. } = Var(e. ) E[T 2 ] + (e. - 1/2) 2 E[T - 0] 2 


where 


T = (1 - 2 i )" 1 


and e = (1 - 2 $) 


-1 
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To obtain E[T] and E [T ] , one way is to find the distribution of 
$. Let r-j out of N-j pixels sampled from Hj be misclassified into TI 2 
and r 2 out of N 2 pixels sampled from n 2 be misclassified into n 1 . Then 

. r l + r 2 

provides an estimate of <i>. The condition of & f 1/2 is easily met if 
we restrict N-j + N 2 to an odd number, a mild restriction which should 
not undermine the generality of present discussion. Denoting r = r^ + r 2 
and N = N-j + N 2 , the random variable r has a Binomial distribution with 
proportion $ and sample size N. Accordingly 


E[T} 55 s N /n\ a r (l _ $) N " r 
r=0 N - 2r \rj 


eit 2 ] = " n 2 2 (") * r 0 - ») N - r 

r=0 (N - 2rr v/ 

However, it is difficult to give any closed-form expression for E[T] and 
E[T 2 ]. Due to -1 < (1 - ~) <1 , 


E[T] = I (2/N) S u 
s=0 s 

and 

. co 

E[T 2 ] = l (s + l)(2/N) S y 
s=0 S 



114 


where y s is the sth moment of Binomial distribution about the origin and 
can be easily obtained by evaluating the sth derivative of the moment gen- 
erating function at the origin, i.e. 

s 


.V 

TW 


It can be shown that 


(1 - $ + $ e V 


-1 


t=0 . 


E[T] - ( 1 - 2$)"' + 0(1/N) 
E[T 2 ] = (1 - 2 $)“ 2 + 0(1/N) . 
Then asymptotically, i.e. as N becomes large, 
Bias { p k ) = 0 


and 


MSE (p k ) = Var (p k ) = (1 - 2ft)"* Var(e k ) 


4. SAMPLING PLAN AND COVARIANCE MATRIX OF e 


In a remote sensing situation involving collection of data over a 
large region, we suggest a stratified random sampling scheme with three 
stages. First of all, stratification will be most effective if strata 
are formed on the basis of at least 

(i) predominance of various crops, 

(ii) latitude, 

(iii) longitude. 

The first factor would lead to homogeneity in various strata whereas 
the other two factors are important from the point of assessing P appro- 
priately . 

Let R denote the region for which crop acreages are to be estimated. 
With the consideration of (i) - (iii) , suppose R is stratified into strata 
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R st , s=l,2,...,a and t=l,2,...,b, with weights w $t , the proportion of 
acreage (number of pixels in a stratum divided by the total number of 
pixels in the region), i.e. 

R = U R. 


s ,t 


'st 


with 


1 


■E 

s,t 


w 


St • 


Due to (i), one may expect elimination of many scenes or even many 
strata if interest lies in only a few specific types of crops. Nevertheless, 
for sampling purposes, select ERTS scenes at the first stage, strips within 
scenes at the second stage and scanlines within strips at the third stage. 

Of course, one can go one more stage in selecting pixels within scanline. 
However, this would not be convenient as far as processing is concerned. 

As such, this stage is not considered for the sampling. 

For stratum R we denote the following: 

e st ijhk = ex P ec_ ted proportions of pixels in it. for kth scanline 
in hth strip of jth scene, 

e st i jh = ex P ecte d proportions of pixels in ir.j for hth strip in 
jth scene, 

e st ij = ex P ec ^ ec ^ proportion of pixels in tt^ for jth scene, 

e st i " ex P ec t e d proportion of pixels in . 

Then 

q u 

i = 1 ,2,.. . ,m. 


a b 

e i = EE w st e st 1 • 
s=l t=l 


Next, let G st , H $t , R $t and n denote the number of scenes, number of 
strips per scene, number of scanlines per strip and number of pixels per 
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scanline, respectively, for stratum R Suppose g s ^, h^ and r $ ^ are 
the corresponding number of scenes, number of strips in a scene, number 
of scanlines in a strip that are selected randomly at three stages. Let 
n st ijhk * 3e num ^ er P 1xe ^ s classified into for kth selected 
scanline in hth selected strip of jth selected scene. Then considering 
the observed proportions for estimates, one has 


'st i'jhk 


st ijhk 


e st ijh . J 


st 

; £ 


st ijhk 


h st r st 


-st ij ■ nr st h st 



st ijhk 


9 st ^st r st 


e st i 


nr st h st g st 


j=l h=l k=l 


a b 


w st e st i’ 1 ,m ' 


s-1 t=l 


For the covariance matrix. 


^ * i ) = 0 ' s $t > g st (G st -i 


^ e st ij " e st i ^ 


G st H st 


. n- -St, 1 

H st 9 st h st G st^ H st _1 


^ e st ijh e st i j ^ 


j=l h=l 



117 


r G st H st R st 2 

+ ^ S s t h st r st G st H st (R sf 1) SZ)S (e st ijhk‘ e st ijh> 


J st i 

Cov(e st i’ e st = G^> g st (G st -l) ' e st ij‘ c st i" c st 1j"'st 

1 1 


1 


it 


i ^ e st ij~ e st i’^ 


G st H st 


+ (1- rfr 


H st g st h st G st v,, st 


+ (H , -1 ) EE^st ijh" e st ij^ e st ijh” e st i ‘ j ^ 


j=l h=l 


G st H st R st 


+ ^~ R st^ 9 S t h s t r st G st H st ( R st' 1} E iE^E ^ e st1jhk e sti jh^ e sti jhk _e sti 1 jh^ 


Thus the covariance matrix for e can be obtained because 

a b 

VaKe,) = E Z “st Var( ®st i> 

S=1 t=l 

a b 

Cov (ej.e,.) - E E "st Gov (®st i- a st i'> ■ 

S-l t=l 

For an estimate of the covariance matrix, 
a b 

Var ( e i> = E Z « l t Var(^ t ,) 

s-1 t=l 

^ si b ^ 


A A 


Cov ( e 1* e i ')“ Z Z «st Cov(e st i- e st i'> 

S=1 t=l 


( 9 ) 


00 ) 


where 



where 


Var < e st i) - 0- ST> 


st 




g st 

E (e 

j=l 


st ij" e st 


i> 


+ Ci- 


st 


1 


g st h st 


H st' G st 9 st h st (h st‘ 1) 2 ^ (Sst ijhest ,j) 

j=l h=l 


+ ( 1 - 

St 


^st^st^st^st' 


g st ^st r st 2 

E E E ( e s t i jhk" e st ijh^ * 

j=l h=l k=l 


Similarly one can write Cov (e^ . e^ .. , ) by replacing the sum of 
squares terms in the variance by the corresponding sum of product terms. 


5. OPTIMUM SAMPLE SIZE 


Taking the cost factor into consideration one may want to know the 
sample size that either minimizes the cost for a specified mean square 
error or minimize the mean square error for a. given cost. In remote 

sensing, the sampling cost would mainly involve a large initial cost plus 

\ 

the processing cost. Although it would depend upon a situation, the cost 
function rnay be considered as 

C st = C l 9 st + C 2 g st h st * C 3 a st h st r st 
for when sampling in stratum R st> and 

C = C l Eg st + C 2 Sg st h st + C 3 Sg st h st r st 


for the whole region. 
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In the present context, there is an additional cost of taking samples 

for estimating P and this can be expressed in the form of 
m 

C' = Cg ^ n i . Thus the overall cost involved is given by 

1=1 C" = C' + C . 

a 

Next, the mean square errors, MS E { p k } , k=l,2 m, are obtained 

from (8) after making substitution from (9) and (10). Now, if the cost 
is fixed, say C" < Cg, a determination of sample sizes, n^'s, 9 S {-' S > 
g s t's, h st 's and r^'s, can be achieved by solving equations obtained 

A 

by equating the partial derivatives of MSETp^ } + a(C"-Cq), k=l,2,...,m' 
and x a Lagrange multiplier, with respect to n^ 's, g^'s, ' s and r ^.'s 

p p 

to zero. Similarly, for any fixed values, say cr£, for MS E { p ^ } , k=l ,2, . . . ,m, 
this can again be achieved by considering the function 

A 

2 

C" + X^( MSE{ p^ } - a^), k=l,2,...,m, for minimization. Of course, this 

procedure may lead to k different values for various sample sizes. For 

a unique determination, take the largest value in each case. 

It may be noted that under this procedure, it is not possible to 

give any closed form expression for any sample size and its carrying 

out would involve some optimization technique. 

One direct way to simplify the problem is to treat the two types 

of cost separately. For example, if we consider only the cost C and 

variances given in (9) for the purpose of determining g ^'s, h^'s and 

r s j.'s, the problem is greatly simplified. Denoting 

G st 2 

(G st -1) S st 1 ~ ^ ^ e st ij ~ e st i ^ 

j=l 
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3 st vn st" u °st 2 


and 


G st H st (R st~ 1)S st 3 
(9) may be written as 
Var(e 


G st 

H st 

2 

E 

j=i 

E 

h=l 

(e st ijh" e st i 

R st 

H st 

G st 2 

E 

E 

E ( e st i jhk** e s t i j h ^ 

k=l 

h=l 

j=l 


; i>-E *[£ 

s,t L 


1 


( S st r S st 2 /H st^ + g st h st (s st 2“ S st 3 /R st^ 


+ S st 3 /g st h st r st “ S st 1 /G stl ‘ 


■] 


which is a function of g t , 9 st h st and 9 st h st r st as is the case with 

A 

the above cost function. Thus, the quantity Var(e.j) +X(C-Cq) or 

A O A 

C + A.j(Var(e.j) - and equivalently Var{e i - ) for fixed C or C for 
fixed Var (e^), is minimized when 


g st ~ w st y ( S st r S st 2 /H st^ AC l 


9 st h st w st J ^ S st 2 " S st 3 /R st^ /AC 2 


9 s t^ s t ^ s t w st 




S st3 AC 3 • 


r st S st 3 


vv 4 


°3^ S st 2' S st 3 /R stl 


St 


C l (S st 2-4 3/R st )/C 2 (S R t ,-S R t 2 /H st ) 


Accordingly, 



and g st <* w $t ^ (S^ S 5 t2 /H s t^ /C l * 

A 

These values are, of course, for the case of Var(e^). Similarly, one 
can obtain sample sizes corresponding to other variances. By choosing 
the maximum of these values in different cases, a unique solution can 
be obtained. 


6. FURTHER COMMENTS 


When stratification is based upon factors of latitude and longitude, 

one might expect different P's over different strata. However, if proper 

adjustment can be made in the classification algorithm wi th respect to 

spectral variation so that P remains the same, there is no need to make 

any change in the above procedure. On the other hand, one should find 

both the actual proportion estimate and its mean square error separately 

for each stratum and then by combining these one is able to obtain so 
* 

called p and its mean square error. In this situation the formula in 
(8) is still valid but stratumwise. 

Our discussion given in Section 1-3 is quite general and can be 
specialized and different sampling schemes can be adopted. For example, 
if our inference set is only one scene, then proportion estimates and 
their mean square errors are again obtained as discussed in Section 1-3, 
but as to sampling scheme there may not be any need of stratification 
and the sampling is done in two stages rather than in three stages. 
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ABSTRACT 


Observations may be known to be coming from m+l normal 
populations having same dispersion matrix; but one may be 
interested in identifying observations from only. Then 
the usual practice is to merge the other populations 7Tj,...,ir 
into one single population it and assume it to be normal. This 
is done in order to achieve the computat i ona 1 convenience of 
two population classification. This paper investigates the 
effectiveness of such practice. It has been found that in 
some situations this practicle may decrease the proper classi- 
fication probability of observations from considerably. 


Key V/ords 

Dispersion matrix, Bayes' procedure, mi sclass i f i ca t ion 
probability, proper classification probability, prior proba- 
bility, mi sc lass i f i cat ion cost. 



1 . Introduction 


Let us suppose that all of our observations come from (m+1) p-variate 
normal populations iTqjTTj ytr^t • • • w i th densities N (y. ,V) (i=0, l ,2, . . . ,m) , 

where yT = (y j ] >1 J j 2’ • • • » P j p) * s a vector of p means of the populations it. 
and V is the dispersion matrix, same for all populations. In general the 
problem of classification into more than two classes (rrH- 1 >p> 1 ) is more 
complex and computationally inconvenient than the problem of classification 
into two classes. In some situations, instead of finding which population 
an observation has come from, one may be interested in finding only if the 
observation has come from some particular population, say rr^. There, for 
achieving the convenience and simplicity of two-population classification 
problerrijone may think of merging the populations tt. .tt^, . . . ,ir m into one 
single population it and further assume the resulting single population tt to 
be normal. Then the question that needs to be answered is how badly the 
classification procedure based on such assumptions affects the miscl ass i f i~ 
cation probabilities. This paper endeavors to answer this question. 

In Bayes’ procedure the prior probabilities play a significant role. 
Improper choice of prior probabilities affects the mi sclass i f ication proba- 
bilities. The usual practice in remote sensing data analysis is to assume 
the prior probabilities q^ and q of y ^ and y to be equal, even though it has 

been assumed that the prior probabilities q^.q^ q m of tTq , tt j , . . . , 7T^ are 

equal. Therefore in answering the question raised in the above paragraph 
we should take into consideration our assumption about the prior probabilities 

Thus our study will be concerned with combination of following sets of 


assumptions . 



Assumptions regarding population densities 


(1) Before merger the populations tt. , <r ...... tt have densities 

U \ m 

N p (Uj,V), i =0 , 1 , . . . , m ; 

(2) After merger, 7 Tq has density H p (pQ,V) and the density of ^ 

a linear combination of the densities N (y.,V), i=l m; 

(3) After merger 7r has density N (y Q ,V) and tt has a density 

u p u 

Np(y,V') , where y and V 1 will be specified later. 
Assumptions regarding prior probabilities 


(a) q Q = 1 / ( m+ 1 ) , = q 2 = ... = q m = l/(m+l); 

(b) q Q = 1/2 , q 1 = q 2 = . . . = q^ = l/2m . 

2. Dispersion Matrix of the Population tt 

The true density p(x) of the population tt is given by 
P(x) = [q j / ( T ~q 0 ) 3 N p (y.,V). 

Now- since in both assumptions (a) and (b) we have 


and q./(l-q 0 ) =* q | / ( 1 -q Q ) = 1/m, 

then the density p(x) under (a) or (b) is given by 

m 

P (x) = (1/m) £._j N p (y. ,V) . 

Therefore the mean y of tt is given by 

y = E (X) = / x p (x) dx = -J- J? . / x N (y.,V) dx 

R m i i R p i 

P P 

= — (y, + . . . + y ) = y , 

m l K m 


and the dispersion matrix V' of 71 is given by 
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V' = [ (X-y) (X~y) T ] 

i m -p 

= — l- , / (x-y) (x-y) N (y.,V) dx 

m fc i=l R P i 

P 

i m -p *T- — x 

= — l. , / {(x-y.)(x-y.) r +2(y.-y)(x-y.)+(y.-y)(y.-y)}N(y.,V)dx 

m i — I R i l t i 1 1 P 1 

P 


" W m (X,-W) ( W,-»*> T - 

Here E^(-) denotes expectation of the population 7r. 

Therefore ; if we want to use a normal density function for ir instead 
of the density p(x) given by (2), we have to use N (y,V'), where V' is 
g i ven by (*0 . 


(4) 


Expression for (V 1 ) 

Let us wri te = V and 

V. = V._j +~(y . -"y ) ( y j - "u) T . ( i = l , 2 m) . (5) 

Then V = V*. We know (Rao, 1965; P- 29) that if A is a nonsingular p x p 
m 

matrix and a and 0 are two p x 1 vectors, then 




n J~ ] -1 A 'a6 T A 1 

(A + a0 ) = A T * i 

(6) 



1+0 A a 


Thus , 

v '! 

= V" 1 — ~V 1 (y j-y) (y 1 -]T) T V V-{ ) + (y ,-y) T V 1 (y ] -y 2 )} > 

(7) 

and 

v : 1 

1 

= v . 2 (y.-y) (y.-p) T v j_|/0 + (y.-y) T V.^y.-y)} , 


( i =2 , . . 

. . , m) . 

Therefore 




(V 1 )" 1 = V _1 - M (say) . 

(8) 


The exact expression for M that can be found recursively from (7) is not a 


simple one. 



3. Acceptance Region For Assumptions (1) and (2) 

We will denote the acceptance region of 7r^ given by a Bayes 1 classifi- 
cation procedure by B. B^, ^ or exam P^ e > w i denote the region B obtained 
under the assumptions (2) and (a). If the populations ir^ , tt j , . . . ,77^ have 
densities p (x) , pj (x) , . . . , p (x) and prior probabilities qQ,q] f *..,q m and 
C (tt j | tt , ) , (jVi ) , denote the cost of misclassi fyi ng an observation from tt. 
into TTj , then it is well known (Anderson, 1958; p. 1 43) that the acceptance 

region B for tt^ is given according to Bayes 1 procedure by 
m m 

B=fl . T { X : X . q . p . (X) C (tt. 1 tt. ) <7 . , ,q . p, (X) C (tt . I tt. )+q p (X) C (tt. | tt ) }. 
j = I ^i = Pi r i v ' ' G 1 i L \t j i j j 1 i M o r o j 1 0 

In case of assumption (1) if we assume all costs of mi sc 1 ass i f i cat i on to 

be equal, then we have 

m 

B i ■ n j=i {x; V» u =• Vj (x,) 

= {x: q„p n (x) > max [q.p.(*)]} . 

l<J<m J J 

Therefore, B. = {x: p n (x) > max p.(x)} 

19 • 1 < j <m J 


and 


B | b = {x: Pg (x) > (1/m) max p.(x)} 

l<j<m J 


In case of assumption (2) we assume that 


and 

Then we have 


CUgllTj ) = C( TT . f TT 0 ) = C ( TTj | TFg) , for all i 4 j J^O 

C (tt . | TTj ) = 0, for all i , j^O . 


B 2 = (x: q Q P 0 (x) > qjPjfx) + ... + q m P m (x)} 


Therefore, 


2a 


= Tx: q Q P 0 (x) > 0-q Q ) Z j =1 [q j/ C i -q Q ) T p t (x) } . 

= (x: Pq(x) > p ^ ( x) + ... + P m (x) } 


B 2 b = ^ x: P 0 (x) * + ••• + P m (x)]} . 


and 
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. Now s I nee max {p^ (X) , . . . ,p m (x) } (1 /m) [p j (x) + . . .+P m (x) ] , the following 

set theoretic inequalities are evident. 

B 1b= >B 2b^ B la^ B 2a * 


( 16 ) 


When p.(x) = N (y. ,V) , (i-0, 1 m) , then we know (Anderson, 1958; 1^7) 

i pi 


that B j can be expressed as follows 


Ml 

B = 0 . {x: W. (x) > 1 og (q . qJ } 

1 J=1 J “ J 0 

= (x: W.(x) > log(q.jq 0 ), for , 

J 


where W^(x) = (yQ - yj)"*"v [x - (1/2) (y^ + Pj)] 


Therefore 

and 


B 


= {x: W. (x) >_ 0, for j = l , . . . ,m} 


la j 

B,. = (x: W. (x) - log m, for j = l m} . 


(18) 

( 19 ) 


'lb 


j' ' - 


When the means y ] ,y 2> ...,y m are collinear, without loss of generality we 
can assume the populations to be two dimensional and assume the means to be 


as fol lows 


y o = 


r a n 


a > 

0 

’ U i = 

i 

i — 

cr 

1°. 

0 



(i = 1,2 m) 


( 20 ) 


In case of three populations, we can take 


— 


'a' 


a 

0 

, y, - 

0 

j and ^ 2 - 

0 

b* 


_ o_ 


— — 


— - — 


Let the 2x2 disperion matrix V be given by V - [v „ ] . Then we have 
Wj - ( 1/ | V | ) [x r (l/2)(a 0 -a)][(a 0 +a) v 22 - b Q v ]2 ] 

+ ( 1/ | V | ) [x 2 - (l/2)b 0 l[b o v n - v 12 (a 0 +a)] 


(22) 



and W 2 = ( 1/ | V i ) [x { - ( 1/2) (a Q +a) ] [(a Q -a) v 2 2 -b 0 V 1 2 ^ 

+ (1/1 Vf) [x 2 - (l/2)b 0 ][b Q v n - v 12 (a Q -a)] . 

The acceptance region B^ for will be bounded by two straight lines, viz, 
Wj“d and W 2 =d f where d=0 in case (a) and d= -log 2 in case (b) . The point of 
intersection of these two lines will be given by 

(v 12 C/D, v 22 C/D), 

where C = [{(a^ - a 2 ) v 22 - 2a Q b 0 v ]2 + b^v^} + 2d[v|] 

and D = 2b 0 (vj ~ = 2b cJ V i » 

j V f denoting the determinant of V. 

From (14) and (15) it follows that the acceptance region B 2 and 
can be given the following general description. 

B 2 = * x: P o (x) > + • • + P m (x)]) f 

where a should be taken as 1 for B„ and as m for B„, . In case of three 

2 a 2b 

populations with means given by (21) we may note that the condition 
ap Q (x) > p ] (x) + p 2 (x) 
is equivalent to 

log [a P o (x)/p ] (x)] > log [p 2 (x)/p ] (x) + 1]. 

Noting that we can wr ' te the above condition as 

log a + (p 2% ) T V 5 [x+(y 2 ~p o )/2] > log [1 + exp (2p 2 V *x)] (25) 

Here a should be taken as 1 for B_ and as 2 for B_,. 

2a 2b 

When all means are col linear, without loss of generality we can assume 

the populations to be univariate and u,<u n <..,<u . V in this case is the 

12m 

common variance of the univariate populations. Now from (18) we have 



( 26 ) 


It now follows from (19) that if y^< y^ , then 


B i 3 = {x: x <(y Q +Uj)/2} , 


if Hj<ji 0 <y J + | , then 


B ] 3 = x:{(y 0 +y .)/2 < x < (y 0 +y. +] )/ 2 ) 


(27) 


and i f y-> y , then 
0 m 


B la = {x: x > ( V p m )/2} 


( 28 ) 


The positions of the end points of the interval depend on Vlogm and the 

, y . B„ can not be given a general description like B, . 
m Za ta 


va 1 ues of Pq , y j » 


k. Acceptance Region for ir 0 : Assumption (3) 

When m populations tt j tt have been merged to form the population ir, 

according to assumption (a) the prior probabilities and q of Hq and ir are 
then given by q Q = l/(m+l) and q = m/(m+l) and according to assumption (b) 

they are given by q^ = 1/2 and q = 1/2. When ir is assumed to have density 
Np(y,V'), the acceptance region B of ir^ will be given according to Bayes 1 
procedure by 

= {x: N p (y Q ,V) > mN p (y,V')} 

= {x: - (x-y) T (V 1 ) ] (x-y) + (x-y Q ) T V Vx-pg) 


< log( jV' I / { V|) - 2 1 ogm} . 


Using ( 8 ) we can write 


B^ = {x: Q(x) < 1 og ( | V 1 | / [ V | ) - 2 logm} , 

where Q(x) = (x-y) T M(x- y) - 2(y Q -y) T V 1 [x- (y Q +y) /2] . 
We can similarly have 

B 3b = (x: d(x) < log ( | V ' | / [ V | ) > . 


(29) 

(30) 

(31) 


Obviously B 3b => B^. 
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Special Case 

Let us suppose that the populations are two dimensional, their means are 
given by (20) subject to the condition p = 0 and their common dispersion 
matrix is given by V - diag {v ^ , v 2 ). Then from (4) we have 

m T T 

V 1 - V + £ p. p./m = diag {v^ + £p. p./m, v 2 >. (32) 

i = l 

Therefore (33) 


IV'l/lVl = 1 + X p . p . /mv ^ 


Writing 


J = [( lab m) h 0], 

we can express V' as 

V' = V + wtJ. 

From (6) and (8) we now have 

M = (m.j) « V ^ V ^/{l+w^V 
Obviously M is a matrix such that 

rtijj = Za?/{v 1 (mv 1 + a?)} 

an d m 1 2 " m 2 | m 22 ™ ^ * 

Therefore, from (20) and (30) we have 

Q(X) = nijjXj 2 - 2p^ V ] x + pj V l p o 

■ "ll*! 2 - 2 Vl /v l - 2b oV V 2 + (a o 2/v l + b o 2/v 2> 

From (29) and (31) it now follows that the regions 
B^ s (s = a, b) are bounded by 

{xj — ( a 0 / v i m ] ] ) ^ = 2(b Q /m 1 jV^) {x 2 — A+ (v 2 /b o ) log a} 

(34) 

where a should be taken as m for B, and as 1 for B_, and ' 

3a 3b 

A - b Q ' 2 + a o 2 v 2 (m n v ] -l)/{2b m n ) - (v 2 /2b Q ) log (IV'l/lVl) (35) 

o 

A boundary described by (34) is a parabola with vertex at the point 


(a / (vjrrijj), A - (v 2 /b Q ) log a ) 


(36) 
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and symmetric about the point 


x i a o /v i m n 


When there are three populations with means given by (21) and 
V = diag {v ] , v 2 >, we then have the regions (s=l ,2) bounded by a para- 
bola (xj-f) 2 = **k(x 2 - g ) > where 
f = a Q ( 1+a 2 ) /a 2 vj 

k = b (l+a 2 )/2 Vo a 2 (3* 

o z 

and g = b o /2 + a 0 2 2 (a 2 Vj - 1-a 2 ) / (2a 2 b Q ) 


- (v 2 /2b o ){ log ( 1 +a / V j ) -2 log a} 
a being equal to 2 for B^ a and to 1 for B^. 

5* Classification Probabilities 

The acceptance region of n being given by B. s ( i = 1 ,2,3 and s = a,b) 

under different assumptions, the true probability of m i scl ass i f icat ion of 

an element from the 'other' population JT into II is given by 

m 0 

P(B. s |n) = I[qj/(l-q o )] / N p ( Pj , V) 


and that of an element of n into 31 by 

o ' 

P(B? |n ) = / N (p , V) , {hi 

is 1 0 p o v 

B? 

I s 

where A C denotes the complement of a set A. As the regions B„ , B„, , 

■ 2a* 2b’ 

B^ a or B^ b are bounded by conicoids, the integrals in (39) and (40) can 
not be evaluated in closed form. 

In special cases, when the populations are two dimensional and the 
common dispersion matrix is diagonal, the boundaries of B^ a and are 
parabolas. The integrals in (39) and (AO) cannot be evaluated in closed 
form even in this case. Therefore, it is not possible to state precisely 



in terms of misclassif i cat ion probability how much we have to pay if we 
want to reduce our many population classification problem into a two 
population classification problem and which one of the alternative assump- 
tions lb, 2 a, 2 b, 3 a and 3 b is preferable next to the true assumption la. 

in special cases, it may be possible to make some inference looking 
at the graphs. 

Examples 

In the following examples we consider three populations with densities 


» 1 > * 

j WIICI t 

a 


-a 


a 

”1 0 

* ^0 = 

0 

L 

, V, - 


and ~ 


0 A j 
J 


D 

L a 


0 




V = 


a, a , b being specified separately in each example. All the populations 
have the same prior probability. 

Example 1 . Let a=l , a Q =3 and b Q =6. Then we have 

B ^ “ {(x lY ): 8x + 3y - 17 >0, Ax + 3y ~ 17 ^0}, 

B lb = { ( x ] y) : 8x + 3y - 1 5 • b2 >0, Ax + 3y - 15. b 2 >.0}, 

= { (x^y) : 3y 2 log [1 + exp (2x)] - 8x + 17) > 

B 2b = ^ ( x | y ) : 3y >. 2 109 [1 + ex P ( 2x ) " 8x + 1 5.b2} , 


3a 


= {(x.y) : (x-b) Z < b(y-0. 23) ) , 


B 3b = { ( x l y): (x " b) - b(y + °- 2 3)>- 

The classification procedure defining the partition B^ a ) is optimal 

in the sense that it has the minimum expected cost of mi scl ass I f icat i on 

(which in this case is equal to the total mi scl assif i cat ion probability). 

From the graph it is evident that the next best partition is ( 823 * * 

c c 

then (B„, , B„, ) , then (B,, , B,,). The two classification procedures based 
2b 2b lb lb 


on the assumption of normality of the mixture perform relatively poorly. 



0 \$ 



FIGURE I 



Of the two, the partition (B^, B^) ^ as ^ ower misclassif ication probabi- 
lity than the partition ( B ^ ^ , B^) . 

Example 2 . Let a=2, a Q =0 and b Q =8. Then we have 

B l a = { ( x i y ) : x+y- 3 1 0, y - x - 3 > 0}, 

- { (x^y) : x + y - 2.66 >_ 0, y - x - 2.66 >_ 0}, 

B 2a = ^ X 1 Y ^ : 2y — log f 1 + exp ~ 2x + 6}, 

B 2b “ ^ x i Y ^ : 2 y 1 lo 9 H + e xp (**x)] - 2x + 5- 3D, 

B 3a = {(x i y) : x2 < 5(y - 3-9*0} 

B 3b = { (x^y) : x 2 < 5(y - 3-6)}. 

The classification procedure defining the partition ( B j a > B | a ) is optimal 
and from the graph we may infer that the next best is the one with partition 
(B^ , ) • Unfortunately, in this case, no comparison of other procedures 

can be made even from the graph. 
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ABSTRACT 

When vc have a single snnple of observa- 
tions iron tvo normal populations with sane co- 
variance matrix, but no training sanple from 
either population, then for classifying future 
observations, the usual practice is to. cluster . 
the sample into two nearest neighbor clusters 
and design a Bayes 1 classifier treating the 
tvo clusters as two training samples* The use 
of £]-distance is often advocated for such 
clustering* It has been shown in this paper 
that such advocation is not always reasonable* 


1 * 

Y7c suppose that our data have been obtained 
through remote sensing devices from tvo p-vnnnte 
normal populations TI i and II £ vith densities 
K (v^t) and K (p 2 i^) respectively, the vectors 
pj and V2 of means and the common dispersion ma- 
trix £ being unknown, and that vc do not have any 
training sample, that is, vc do not know the act- 
ual source of our observations; but tnc data is 
assumed to be such that tvo and only two distinct 
tv>des can be determined from them. For classi- 
fying future observations into these tvo popula- 
tions on the basis of such data, the usual pia«- 
ticc is to cluster the data Into U:o nearest 
neighbor clusters Cj and C ? using irstrlc or dis- 
tance function defined by f.\- t or ^»~ noi!n * 

and then design A Layer. ’ cln.-.s 1 f J er vit!i the 
assumption that the population densities are 
N (fl , JC )(J-1 vbrve and ^ nro respec- 
tively iSu: laurpJe mean and the sample dispersion 
tnatrJx of the sample 0 ^ a.'nnivvd to be con* :J filing 
f ,f observations fvM:i the population alone. H 


lc Is known that the parent populations have the 
sa*e. dispersion matrix. then in designing the 
classifier, the matrices Ei and E 2 shove are re- 

A 

placed by I, where 

1 * [(Ky-DEi + (S 2 -l)^l/.(h'l + ”2-2), CD 

H .denoting the number of elements of C^. A 
classification procedure defined in this fashion 
will be referred to as clnster hnserf classifies 
tion procedure . 

Let P denote the Bayes* procedure based on 
complete knowledge of N p (p..,r) (j =1 . 2 > and P i the 
cluster based procedure when L^-noro (i=l,2,«>) 
has been used for clustering and the sample sises 
are very' large. Our objective in this paper is. 
to compare the tdsclasnificatioti probabilities of 
these classification procedures. Ve have made 
the comparison a little simpler by taking p=2, 
vr 0 and l *= I, that is, by assuming our popula- 
tion densities to be U 2 (0,I) and J! 2 (V 2 >D- In 
designing the classifiers we have also assumed 
that both populations have equal prior probabil- 
ities (1/2 each) and equal tr.is classification 
costs. Kow , without any further loss of gener- 
ality, it can be assumed that p 2 a la,h] and 
a>_0 , b>0 . 

2. Classification Procedure P Q 

Under the set of assumptions made above, the 
Bayes* classifier is given by 

Vi (2) c ax + by - (a^+b^)2/ ,2 = [x»y] L- 

and hence the acceptance region- for Til is 
given by 

B • t ,y> : a>: + by < (a +b )/?)« '• Jl 

If P(j[i) denotes the probability of n; sclassi 
fylng an element from ir.to then for tbi- 

procedure vc h:*\ p c * 

P(?JU P(1 t 2 ) " <'(-/a 2 lb 2 /2), 

where * __ 

<,(() - y exp (-x^/2.) 



If a=l and b-2 , then ve have from W 

r C2 1 1>. “ r < 1 1 2 ) » v(-7s/2) ■- 0.13. 

3. Clus ter Base d Classifi cati on 

It is veil known (Lcrcan, 1970) that as the. 
sanple sizes increase, the clusters and C^. 

based on 1 *iwm converses with probability one to 
the partition (S^S^) of R 2 , the two dimensional 
real vector space, vljere 

{r = (x,y) • I U1 ! t < I 

(5) 


Si - S li 


A C denotes the complement of the set A and 
||u-v||. denotes the distance between the points u 
and v of R 2 according to i^-norm. j3 j ± and the 

mean and dispersion matrix of the sample C^, 

can be viewed as the sample mean and san- 
ple dispersion nntrix of a population vi th density- 
function p^ i (z) given by 

p (z) = (l/2) {X 2 (0,1) + N' 2 (t'2.I)]> z c S ji^ 6 ^ 

' ji 

= 0 otherwise. 

Thus, by the lav of large numbers, with probaMl 

>e 

(7) 
(S) 
(9) 
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leg to the Hayes’ classification procedure in 
which the population densities for By and H 2 have 
been inadvertently assumed to be M (m,V) and N 
(n’,V) respectively. Then 

E = {Z = (x,y): W(x,y) 0), (12) 

where 2 2 

V(x,y) = 2C ei x+g 2 y) + K + 2exy-djx -d 2 y , (13) 

di “ vf 2 /IV'| - 
£ 2 - Vjy/fv'l - v n /|v|, 
e = viz/lv'l - v 12 /lv[, 

El = (vi 2 ni-vj 2 m|)/|v'| - (v 22 m r v i2 in 2 >/ ! v[ , 

' g2 = (vf jci£-v{ 2 mf) / 1 V’| - ( V 11 B 2 V) 2 ni)/lvi, 


-' T V'‘** 1 » > - log( | V' | / ! V | 
[v'.] ( m T = [mi,m. 2 ] and 


Ity one (L ± converges to and 1^. to I J1> where 

^ | ji ZP :(i Cz)dz/P(S jl ) 

' (z-p 1< )(z-l>m) T P it ( ! ’^ 2/r(S iP 




ji 8 ^" 'n- ‘3i **■ 

and P(S^) - { PjiW-*- 

In designing cluster bas.cd classifiers, as we 
have mentioned earlier, ve shall consider two 
cases— when ve do not know that the dispersion ma- 
trices of the population fly and H 2 are equal and 
when ve know that they are equal. In the first 
case, ve shall denote the cluster based procedure 
by P (based on J-.-nom) and in the second case ty 


Thus P' is the Hayes’ procedure based on the 

densities K ? (p ,P ) (3 = 1.2) . where 

J - (10) 


h m v ^ii n u + n hi )r - 2 i- 

Ki s c In r. s i f i cn t i_n ii_V r y V :\b 1 1 1 1. v_ 


If a clarification procedure defines a set £ 
a?; the. acceptance region for the population ilj 
with true density l.*p (0.1) and U C ar. the acceptance 
region for II 2 with true density U;.(n 2 ,T), then the 
nisclar.s tfication probabilities of' this procedure 
arc given by 

l‘(2 1 1 ) “ l > (li r |i |*)- “ 1 h’ 2 (f>,k)<la. 


nud V (1 1 2) “ P(ii|li 2 )' 


B- 


(14) 

(15) 

But 


T -1 

K = u V m - m 
Here V = [v^], 

= [nj,n 2 ]. 

The rdsclassification probabilities P(2|l) 
and P (1 * 2) can now be expressed in the form 
r(2|l) - P(V(X,Y) >0 |z e III) 
and P(l[2) = P(V.'(X,Y) <0 \Z c H 2 ) 

If V^V", then the exact distribution of 
V(X,Y) when Z e Hi or Z e n 2 is intractable, 
if dj, d 2 and c are negligible compared to gy and 
£2, then in (14) and (15), the contribution from 
the second degree terms in X and Y will be negli- 
gible compared to that from K + 2(gyXlg 2 Y). Thus 
ve can perhaps approximate the ^distribution of 
y(>;,Y) by the distribution of V?(K,Y) “ K + 

2 (gyX-5-g 2 Y) . But if V-V', then W -W. Kov W can be 
shov-n tote distributed normally with mean 
F.< = K, when Z-vI^fO,!) 
ar.d T» T = K + 2(gya+g 2 b), 


when ZvK 2 (ii 2 ,I) ,U 2 = [a*hl 


and variance 


2 2 

Yar b * 4(gy +g 2 ) 


( 16 ) 

(17) 


in either case. Tlie approximate values of the 
nisei ass if ication probabilities P(2[l) and P(l|2) 

car, be given by __ j~— 

p(2 [ 1) = P(C>0[2 c By) — l-fr(-K/2 < /ci +£2 > 


and F ( 1 ] 2 ) - r('.:<ojh r. H ? ) ° 
<-(-{1 (El -vl-gib) -tK}/2s/gy~ 


+E2 2 ) . 


(lb) 

(19) 


DO 


4 « C K"; ^ t". 1 f 5 c i \ t l_on Vr£^Ci'_d] | [H" P 2 ^ 2 

Tiic. L^-norn induced distances between Cw*o 
points ti - (uy,u 2 ) T and v - tvy.v-l^ if. given by 
||u-v || 2 » (uj-vj ) 2 + (uj-vy;)") 1 ^ . 

Since » [a ,b] , tJien it follows from (5) that 
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Limit ins clusters using 1^, 1 2 ” 
and 1 - nor a 

es> 
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the limiting; clusters 5)2 ^22 ore given by 

$ 22 c * S j 2 “ {(x.y)s ax-iby < (a 2 -H> 2 )/2>. 
Evidently Sj 2 =U o , as given l>y (3). h’ow it fol- 
lows from (6)- (9) that 

p(s 12 ) - r(s 22 ) - 1/2. (20} 


hi 

- k^J cnd 


T - 

ft + a 2 k 1 (l-k i ) abk 1 (l-k ± > 

, (21) 

12 

jabk^(l-kj) 1 + b k^(l-k^)_ 

• 

(i=l*2) » 

where 


k X « 

= «(- a 2 +b 2 /2) - ( 2li(o 2 +b 2 )/2)" :l 



exp {-(a 2 4b 2 )/8) 

(22) 

and V 2 ' 

= <-( a 2 +b 2 /2) + ( 2!!(a 2 +b 2 )/2)“ 1 



exp {-(a 2 +b~)/S}. 

(23) 


The set B 2 . the acceptance region of H) 
according to Pa> can be obtained from (12) by 
letting tii-JJ ] 2 » “^22* 1 7= ^12 and an( * 

corresponding mis classification probabilities 


from (13) , (18) and (19) - 


An Exampl e 

Let a-1 and b-2. Then 

$ 22 C e S 12 =' {(x,y): x+2y<2.5} 

[figure (3)] 

and it follows from (21) , (22) and (23) that 

T 

kj = -0.06, k 2 - 1.06, p^2 = 



and IT 2 , the Bayes' classifier based on densities 

N (pi2,r l2 > at“l «a(’42Z.J:22) is usetl instoad of 
that based on the true densities N 2 (0,T) and 
N 2 ([1,2] T ,I) , which are unknown, Then the ap- 


proximate values of the resulting nisclav-sifica- 
tion probabilities are obtained from (IS) and 
(19) as 

P(l|2) « 0.22 and l'(2|l) = 0.078. 

In procedure P 2f the Bayes' classifier is 
based on densities I! 2 (lHJi8) and h 2 (u 22 ,l), where 


l « r(S U )Ei 2 + VCS 22 )E 22 


[ 0.935 -o.n?! 
1—0 . 3.3 5 0 . 765J * 


The nisclnsslfieation probabilities are now ob- 
tained fron (18) and (19) as 

r(l|2) = 0.18 and l’(2 |l) = 0.09. 

5. Class! t tea Lion I'rnc <'hi:o l’j and l'j ' 

The £ j-norn Induce*! distance bc*Lt;een the 
.points u and v is given hy 

| |u-v] [ ) ” |>i|-Vj | J u 2 -v 2 | • 

The lir.i t Jnj; cluster 8i)(”8;.) C ) consists of 
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points (x,y) satisfying the following conditions 
in three different cases. 

Case 1 . b>a>0 [figure (2) and (3)) 

(i) x>n, y<(b-a)/2 
(ii) 0<x<a, x+y<(a+b)/2 
(iil) x<0, y<(a+b)/2. 

In this case Sj j and S 2 i are distinct fron S 12 and 
S 22 given hy (3) . 

Case 2 . b>0, a=0 [figure (1)1 

<i) y<b/2. 

In tills case Sjj coincides with $i 2 an ^ ^21 co ^ n- 
cidcs with S 22 - 

Case 3 - b=a>0 [figure (4)] 

(i) x+y<a 

The tj-norra distances of the point (x»y) from 
(0,0) and (a,b) are equal when 
x+yXa ar.d x<0 
or x+y>a and y<0. 

These points have been assigned to S 21 In order 
that no point escape classi fication. In this case 
also the cluster S)} coincides with S ) 2 and S 2 i 
coincides with 

Thus the procedure PjfPl) differs from the 
procedure V 2 (^ 2 ) only in case 1. In this case* 
the expression for the components of the means pyj 
and V 2 I and the dispersion matrices l 2 l and S 21 
arc long and complicated egressions involving the 
probabilities o(a)> £-t(e+b)/2] and $[(b-a)/2]. 

For that reason ve have not included it here. As 
in the example in section 4 S if ve tahe a-1 and 


then we shall have 



Co. 006] 

_ n 

3.423] 

vn ** L o.iotJ > 

V 21 *= y 

yilj 

[5.93 -0.06] 

1 and ] 

"l.82 

ll1 = U 0 - 05 0-84- 

i Eal \ 

L"- 2 


Proa (13) it now follows that 

VI (a, Y) = 0.46X 2 + O'. 24Y 2 + 2(0. 52X + 2.46Y) 

+ O.OlXY -9.36 

The exact distribution of V is intractable. The 

nom.il approxi. ration to W given in section 3 is not 

2 

valid, since the coefficients 0.46 of X and 0.24 
2 

of V are not negligible compared to coefficients 
0.32 of X and 2.46 of Y. Using a very rough ap- 
prox J nation, it can be shown that in ill is case 
P(2[i) and P(]{2) are larger than those 

for . 

For procedure -Pf v:c have 

i\ -T(Sn)in + P(Sii )r-n 

0.495 r.] I + 0.505 E ? ) 
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From ( 13 ) wc now have 

W(X,Y) = 1.11SM + S.44Y - 6.541 

end from ( 18 ) and ( 19 ) vc have 

P < 2 1 1 ) = 0.18 and r(l( 2 ) = 0.12 

6 . Class I f lent ion Proc ed ure P^ and P^ 

The t -norm induced distance between the 
* «0 

points u and v is given by 

I J u ~ v l L = Dnx fl u l~ v ll* | u 2 *“V2 | ) * 

The points (x,y) in the limiting cluster 
Si (o (=S 2 <u ) satisfy the following conditions. 

Case 1 , b>a >0 [figure ( 2 ) and ( 3 )) 

(i) a-b/ 2 <x <n 3 X (h/ 2 ,a) and 

y<b/ 2 , 

(ii) 'x<a-b /2 and x+y<a 
(ill) x>max (b/ 2 ,a) and x 4 -y<b . 

Case 2 . b~a >0 [figure ( 4 )] 

(i) >Hy <a 

The l - distances of a point (x,y) from. ( 0 , 0 ) and 
(a,b) arc equal when 
xdy>_n and x <0 
or x^y>a and y< 0 . 

Such points have been assigned to $ 2 ro arbitrarily - 
Thus in this ease Sj^ and 82 ^ coincide with Sj 2 
and S 22 respectively. 

Case 3 . b> 0 , a -0 [figure ( 1 )) 

In this case, the performance of A^-norn in clus- 
tering is very poor. The £ - distances of every 
point (x,y) from ( 0 , 0 ) and ( 0 ,b) arc equal when- 
ever 

x-y-i 2<0 and xdy <0 
or, x+y~ 2>0 and >:-y >0 

However, v:e can arbitrarily define Sj p? as the set 
{(x.y): y<b/ 2 } . 

But S\ then becorr.es identical to $\\ or S} 2 * 

Thus the procedure P^(P^) differs from P 2 (^f) 
only in case 1 . But in this case, the exact nis- 
classifJ cation probability is hard to find. In 
the special case, when a-I and b= 2 , (figure ( 3 )) 
ve find that the boundary line of and can 
be obtained as the reflection of the boundary line 
of Sji and Sp \ about the boundary' line of Si 2 and 
S 22 * ’ Frer* thlji v« nay guess that, as in ease of 
the procedure Pj ami Pj, the total nisei ansi { Jea- 
t Jon prob.tMl tllefi of P^ and 1 *' will be larger 
Ilian the total vtJ.se lausi f J cat Jem probabilities of 


7. Cone 1 u s ion 

Our findings about the w Is classified tlon 
probabilities r(l|2) and P(2|l) for the procedures 
P , 1*1 • r l» p 2 and p 2 ln efie special case vlien a-1 
and b=2 can be seen at a glance from the following 
table . 


T able 



True 

Bayes 

ij-nom based 

t 2 -norm 

base 


P 

0 

pi p{ 

*2 

*2 ' 

pa!2) 

0.13 

. 0.12 

0.22 

0.18 

?(2|l) 

0.13 

0.18 

0.078 

0.09 

Total 

0.26 

* 0.30 

0.299 

0.27 

*Using 

rough 

approximation it has 

been found 

that 


the total msclassification probability oC Pj is 
greeter than 0,229. J 

Thus, taking nisclassi fication probability as 
the criterion for comparing the performance of the 
classifiers, ve find that in the example consid- 
ered above P 2 is better than Pf and P 2 is better 
than Pj. But, no such conclusion can be made in 
general. It is not known whether P 2 always has 
lower r.is classification probability than its com- 
petitor Pf or and P 2 has lower nis classifica- 
tion probability than or ? w - 

Anong all norn-induced distances, fj-dis-' 
tances arc the most easy to compute. When popula- 
tion means are known to be such that the straight 
line joining then is inclined at an angle of 45° 
to x-zixis (for example, vhen they are (0,0) and 
(a, a ), the classification procedure P ^ (P f ) and 
? 2 (F£) become identical. Only in that ease, ?i cr 
Pf is the optimal procedure, been vise wc then 
achieve the computational ease along with low mis- 
classif feat ion probabtli ty . 

The procedures P and P" arc not desirable. 

xn to 

If in the process of clustering, tvo cluster cen- 
ters are such that they have the some /-coordinate f 
then seme points arc to he clustered according to 
l 2 ”di stance. This arbitrariness In allocation of 
points to the clusters, makes P or T' undesirable. 
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1 . Introduction 

A classifier or discriminant function W(xjc) is a function or a set of 

/ 

functions which determine the membership of an observation vector x into one 
of several given sub-populations, c denoting the vector of parameters of the 
classifier. The classifier usually considered is one which has minimum 
expected cost of mi scl ass i f i cat ion among all classifiers. This optimal 
classifier is known as. Bayes' classifier [2]. This classifier is obtained 
on the assumption of complete knowledge of mi scl ass i f i cat ion costs, prior 
probabilities (the proportion of the given subpopulations in the over all 
population) and the probability densities of the given subpopulations. When 
the parametric family of the population densities are known, but the parameter 
themselves are unknown, then the usual practice is to replace the true values 
of these parameters by their respective estimates based on sets of past 
observations of known classification in the expressions giving the elements 
of c. A classifier W(x, c) , thus obtained, lacks in the above optimality 
property. A set of past observations known to be coming from a particular 
population is often referred to as a training sample. V/ hen the parametric 
family of the population densities are also unknown and sometimes the prior 
probabilities are also unknown, the methods by which a classifier W{x, c) is 
obtained from the available training samples of the populations are referred 
to as nonparametr i c methods of classifier design. An adaptive pattern recog- 
nition scheme is one such nonparametr ic method. 

An adaptive pattern recognition scheme is a nonparametr ic method in 
which a classifier is assumed to be of the form 
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W(x, c) = CjQ.j (x) + ... + c m (x ) = c T $>(x), (1) 

where the components (x) , . .., Q^(x) of the vector <$(x) are prechosen 
linearly independent continuous functions of the observation vector x and 
the unknown parameters [c j , . .., c^] = c T are directly estimated recursively 
as in [1, 7] or nonrecursively as in [6, 8] from the available training 
samples, instead of being obtained terms of the estimates of the probability 
densities. A classifier obtained in this fashion will be referred to as an 
adaptive classifier . 

The functional forms of the functions Q.(x)'s and their number m are 
often choosen arbitrarily. This arbitrariness in the choice of the func- 
tions Q. (x) 1 s may create problem. A good classifier should be optimal for 
discriminating among the given populations. For example, it is well known 
[2] that when there are two p-d imens ional normal populations with different 
covariance matrices, a linear classifier of the form 

W(x, c) - CjXj + ... + C p X p +C Q = C^Z, (2) 

T T T T 

where c = [C|, ..., c p , c q ] , x = [x^, ..., x ] and z — [x , 1], Is not 

optimal for discriminating between these two populations. Therefore, if the 
functional forms of Q,. (x)'s are not properly chosen, then the classifier 
W(x, c) given by (1) may not be optimal for discriminating among the given 
populations. Also in designing an adaptive classifier rarely any attention 
is paid to the optimality criterion, minimum expected cost of misclass if ica- 
tion. However, adaptive classifiers are easy to design and are very attrac- 
tive in fast classification for their simplicity. Besides, there are 
empirical cases in which they do no worse, if not better, than many other 
sample based classifiers [2]. 



A comprehensive survey of adaptive pattern recognition schemes can be 
found in and Agrawala [5] and in Tsypkin [ 9 ]. Our objective here is to 
present a unified theory of adaptive pattern recognition. Most of the adap- 
tive pattern recognition schemes have been devised for discriminating between 
two populations and few of these schemes have been generalized in a very 
straight forward manner for discriminating among more than two populations. 
For this reason, we shall also restrict ourselves mostly to the case of two 
population classification. 

2 . Basic Theory 

Let all the observations come from two p-dimensional population tTj and 
tt 2 . . Then the observations can be represented by points and the two training 
samples by two sets and of points in the p-dimensional real Euclidean 
space Ep. It will be evident later from the way an adaptive classifier is 
designed that the sets Tj and T should be disjoint. A real function f(x), 
f: Ep -* Ej, is sai'd to separate two disjoint sets Tj and in E , if there 
exists a constant t such that f(x) j> t for all x e Tj and f(x) <r t for all 
x e T . If there exists a function f(x) separating and T , then f(x) can 
be considered to be an ideal classifier which can be used to classify all of 
the points of Tj and correctly. it may not often be possible to construct 
such f(x), even if it exists. But, it may be possible to obtain a function 
W(x, c) as an approximation to the function f(x) and use it as a classifier 
instead of the ideal classifier f (x) . This is precisely the way an adaptive 
classifier is designed. 

The existence of function f(x) separating the training samples T^ and 
7 ^ will be evident soon. The following theorem is well known (Dunford and 
Schwartz, [^] , p. ^17). 



Theorem. In a topological vector space, any two disjoint convex sets. 


one of which has an interior point, can be separated by a nonzero continuous 
1 inear functional . 

If Tj and have disjoint convex hulls Cj and then we can use the 

above theorem to prove the existence of function f (x) of the form 

f (x) = c.x. + ... + c x + c 
11 p p o 

such that f(x) > 0 on Cj and f(x) < 0 on C^- In this case the training 
samples Tj and T ^ are said to be linearly separable. Therefore, when the 
training samples are linearly separable, then a linear function f(x) separat- 
ing Tj and T^ can be used as an ideal classifier. An adaptive classifier is 
then easily obtained as a linear function approximating this ideal classi- 
fier. 

Unfortunately, the sets Tj and T^ of training samples are rarely lin- 
early separable. Therefore, often there may not exist any linear function 
separating Tj and 'T . However, there exists a function f(x), not necessarily 
linear, which will separate Tj and T^ whenever the sets Cj and the 
smallest closed sets containing Tj and T^ respectively, satisfy the conditions 
of Uryshon's theorem, well known in Topology ([4], p. 2**). 

Uryshon's Theorem . If Cj and are disjoint subsets of a normal topo- 
logical space V, then there exists a continuous function g(x), g: V -*■ [0, 1], 
such that g(x) = 0 for al 1 x e and g(x) = 1 on Cj. 

Using the function g(x) of Uryshon's theorem, we can easily define a con- 
tinuous function f(x) on V such that (a) f (x) = tj on Cj and f(x) = t 2 on C 2> 
where tj and t^ are any two distinct preassigned numbers, or, (b) f(x) > 6 on 
Cj and f(x) < — 6 on C^, where 6 is a preassigned positive number. Now, since 
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it is well known that is a normal topological space under any nontrivial 
topology and Tj and are disjoint sets containing finite number of points, 
then we can always obtain two disjoint sets Cj and as the smallest closed 
sets containing Tj and T^. Therefore, as long as and are disjoint sets 
of training samples, there always exists a function f(x), linear or non- 
linear, which separates Tj and T^. 

As we have mentioned earlier, in an adaptive pattern recognition pro- 
cedure the ideal classifier f (x) is approximated by a function W{x, c) of 
the form (2), namely, a linear function of an observation x to obtain a 
linear adaptive classifier , or, by a function W(x, c) of the form (1), a 
linear combination of a finite number of known nonlinear continuous func- 
tions, to obtain a nonlinear adaptive classifier , the parameter vector c 
being selected in order to minimize certain given pay-off function. 

As the probability densities of the populations are unknown, expected 
mi scl ass i f i cat ion ' cost cannot be evaluated and therefore cannot be used as 
a pay-off function. Distances D(f, W) between the functions f (x) and 
W(x, c) or some functions of it defined by some meaningful metric can be 
used as geometrically and intuitively appealing pay-off functions. For 
example, using Euclidean metric we can define 

D 2 ( f , V/) = | f(x) — W(x, c)] 2 , (3) 

xeTjUTj 

as a pay-off function and select the parameter vector c in order to minimize 
D 2 (f, W) . In this case W(x, c) is a least square approximation to f(x). 

In order to assure the existence of a unique minimum of the pay-off function, 
a pay-off function of the form 



L (f , W) = 


w 


F (f (x) — W(x, c) )/n , 
xeT ] UT 2 

where n is the total number of points in and and in the fashion of 
decision theory F is a convex function interpreted as a loss function, can 
also be used [9]. We may note that the pay-off function (f> W) = 

D^(f, W)/n, often referred to as mean square error criterion [6, 8], is a 
particular case of (4). 

Let the subpopulation II. have prior probability q. and probability 
density function p,(x). Then the population which is a mixture of IIj and 
JI^ has probability density function q^Pj(x) + q^p^x). Therefore, the 
expected value of F(f(x) — W(X, c) ) will be given by 
R(f, W) = E F (f(x) - W(X, c)) 

2 

= q. / F(f(x) — W(x, c)) p.(x) dx (5) 

i = ] 

Now, if there are n. observations in T. and Oj + = n, then we can write 

2 

L(f, W) = (n./n) F(f(x) - W(x, c))/n. (6) 

i=l xeT. 

i 

and observe that L(f, W) is the sample mean and R{f, V/) is the population 
mean of F (f(x) — W(X, c)). 

It is important to note that there is some arbi trariness in the choice 
of the number m and the continuous functions Qj(x), ..., (^(x) , used in the 
definition of W(x, c) . 
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3 . Linear Classifier 

The linear classifiers are used in adaptive classification mainly due 
to following reasons. 

1. They are relatively simple to implement as electronic circuits. 

2. The circuits can be made adaptive very easily. 

3. The Bayes* classifier with minimum expected cost of mi scl assi fi ca- 
tion has a linear structure when the two populations have normal distribution 
with equal covariance matrices. 

3.1. Ni 1 sson 1 s Cl ass i f i er 

The algorithm proposed by Nilsson [7] iteratively determines the para- 
meter vector c of W(x, c) , as defined in (2), according to the following 
rule. Let x(i) denote the i th observation of known classification, that is, 
an element selected from T ^ UT 2 at the ith step of an iteration, z(i) the 
corresponding value of z, c(i, n) that of c at the ith step of the nth 
iteration and W.^ = c^(i, n) z(i). Then at the (i+l)th step of the nth 
iteration we have 

c(i+l, n) = c(i, n) if x(i) e T ] and W- n > 0 

= c(i, n) if x(i) e and W.^ < 0 

= c(i, n) + dz(i) if x(i) e T^ and W.^ <_ 0 

= c(i, n) — dz(i) if x(i) e T^ and >_0, 

where d is a positive number, so chosen that c^(i+l, n) z(i) can correctly 
classify x(i) for all i. The vector c(l, n+1) is taken as the value of c 
at the end of nth iteration. The process is continued until we have, for 

A 

some i, c(i, n) = c(j, n) = cU, n+1) = c (say) for all j>i and £<i. When 

✓V A 

such c exists, we shall say that c(i, n) has converged to c. 
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Unfortunately, if the sets T| and of training samples are not 
linearly separable, then the sequence c(i, n) may not converge. The algor- 

A 

ithm is effective, that is, c(i, n) converges to a c, only when and 
are linearly separable. It should be noted that c is not unique; it depends 
on the sequence in which we select the observations from Tj UT^. The num- 

A 

ber of iterations needed to reach such c also depends on this sequence. 

The classifier W(x, c) constructed in this fashion is not in general a 
Bayes classifier. 

Example . Let T ] = {[0, 1] T , [1, 1] T }, T £ = {[0, 0] T , [1, 0] T }. Then 
choosing d = 1 and the sequence of observations as in the Table we can 
obtain Nilsson's classifier in the following way. 


n 

i 

c 

Z 

w. 

1 
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C 1 

C 2 

c 

o 

x i 

X 2 

1 
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0 

0 
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2 

l 

1 

l 

l 

0 

1 
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Yes 
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1 

0 

0 

1 

1 

1 

No 
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0 

1 

0 

0 

0 

1 

0 

Yes 
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0 

1 

-1 

1 

1 

1 

0 

Yes 

2. 

2 

1 

2 

0 

1 

0 

1 

1 

Yes 
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0 

2 

-1 

0 

1 

1 

1 

No 


4 

0 

2 

-1 

0 

0 

1 

-1 

No 
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2 

-1 

1 

1 

1 

1 

No 

3 

2 

i o 

2 

-1 

1 

0 

1 

-1 

No 


a t 

Therefore c = [0, 2, -1] and the classifier is 2x^-1. A decision rule 
based on this classifier is to assign x to 11^ if 2x^-1 > 0 and to H if 
2x 2 -1 < 0. 




where 


l - dE and c q = -{£T(Njnij+N 2 m 2 ) — (N |-N 2 ) [/(N^+l^ » 

d - 2/ [ (N 1 +N 2 ) 3 /{N 1 N 2 (N 1 +N 2 -2) } + (m,-!^)] (10) 

and Z = (N ] S 1 +N 2 S 2 )/(N | +N 2 -2). (11) 

When Nj = N 2> then we have 

c = (m.-rO^I ' (m. +m ) [Nd/4 (n- 1 ) ] (12) 

0 12 12 

The classifier W(x, c) = c^z in this case resembles the Bayes' classifier 
with parameters of the population densities replaced by their respective 
estimates, that is, Anderson's [ 2 ] sample based discriminant function for 
two normal population with equal covariance matrices. 

The classifier attributed to Agmon and Mays [ 5 J is obtained when 

L(f , W) = £ [ |W(x, c) - f (x) [ - {W(x, c) — f (x) }] 2 

1 xeTj UT 2 

is used as the pay-off function. 

4 . Nonlinear Classifier 

The nonlinear classifiers are used in adaptive classification mainly 
due to following reasons. 

1. If the sets Tj and T 2 of training samples are not linearly separ- 
able, then a function f(x) that can be used to separate them are often 
nonl i near. 

2. The Bayes classifier has nonlinear structure when the two popula- 
tions have (a) multivariate normal distributions with unequal covariance 
matrices, or (b) nonnormal multivariate distributions. 
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3 . 2 Other Linear Classifiers 

An adaptive linear classifier can be obtained as linear approximation 
to a function f{x) separating Tj and T 2 and taking the value 1 on T j and 
-1 on T 2 , the parameter vector c being selected as a vector c that minimize 
a given pay-off function. Different pay-off functions can be used to obtain 
different classifiers. 

Koford-Groner Classifier 

An adaptive linear classifier W(x, c) may be obtained by selecting a 
vector c as the parameter vector that minimize the mean square error cri- 
terion D 2 (f, W) , or equivalently the squared Euclidean distance D 2 (f , W) 
given by (3), using gradient method or any other suitable optimization 
technique. An iterative algorithm based on gradient method for obtaining 
c can be given by 

c(n+l) = c(n)+(>zE {f (x) — c T z} ( 7 ) 

XeT l UT 2 

Koford and Groner [6] observed that the criterion D 2 (f, W) can be expre 
expressed in the form 

2 

D 2 (f , W) = E NjU^S, + {sJm. + c q + (-1)-*} 2 ], (8) 

j = l 

T T T 

where C = [£ , c q ] , m^ = E x/N^ , S^. = Z (x-nrK)(x-mj) /Nj 

xeT. xeT. 

J J 

and N. is the number of observations in T.. The values of £ and c that 
J J ° 

minimize D 2 (f, W) , given by (8), are as follows. 



As mentioned earlier, a nonlinear classifier W(x, c) ts obtained in the 
form of a linear combination 


W(x, c) = c T 4>(x) = CjQjU) + ... + c^Q ^(x) 


( 1 ) 


of linearly independent nonlinear continuous functions Qj (x) , ..., Q m (x) , 
whose functional forms and number are prechosen. The performance of the 
classifier W(x, c) , how often the classifier will be able to correctly 
classify an observation of known classification, depends primarily on the 
choice of these functions. But, often they are chosen arbitrarily on the 
basis of economic consideration and one's intuition. However, sometimes, 
when one has some prior knowledge of the population densities, they can be 
satisfactorily chosen. For example, when it is known that the populations 
have multivariate normal distributions with unequal covariance matrices, the 
function W(x, c) may be chosen as a second degree polynomial in x^, ...» 

x , the p components of the observation vector x, and the functions Q.. (x) 1 s 

P k l k 2 k D 

as the monomials x^ x^ . .. x , where k.'s are 0, 1 or 2 and 

k + k + ... + k = 2. It has been suggested [5] that when the densities 

1 2 p 

are unknown, the functions Q..(x)'s should be taken as orthogonal or ortho - 
normal functions such as generalized Hermite functions given by 


(n) 

k 1 ’ k 2 ’ ' ’ k p 


■n/2 


(x) = (-l) n (2 it) 


9x, ... 9x 

I P 


~r — exp (-% z x. ) . 

K r. 1 

i = l 


Unfortunately, there is no specific rule to guide us In selecting these 
functions judiciously. 
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4.1. Patterson — Womack Classifier 

The adaptive classifier of Patterson and Womack [8] is a nonlinear 
classifier W(x, c) obtained as an approximation to a function f(x) that 
separates the training sample sets and and takes the value C ( 2 | 1 ) on 
Tj and - C ( 1 j 2) on where C ( j | i ) denotes the cost of m i sc 1 ass i fy i ng an 
element from II. into IL . The parameter vector c is selected as a vector 
c that minimize the pay-off function 

M(N p N 2 ) = (q^N,) E |w(x, c) - C (2 | 1 ) [ 2 + 

xeT^ 

(q 2 /N 2 ) E |W(x, c) + C ( 1 1 2) j 2 , (14) 

xeT 2 

where N. is the number of elements in T. and the prior probability q. of 
the population H. is assumed to be known . As in (5) and (6), it is not 
difficult to see that M(Nj, N 2 ) is the sample average of [w(X, c) — f(X)| 
and therefore, by law of large number, M(Nj , N 2 ) converges to 

R(f, W) = E|W(X, c) - f(X) | 2 

as N| -»■ ® and Patterson and Womack [8] have shown that the vector 

c that minimize R(f, W) , as given above, will also minimize 

E | W ( X , c) - D { X) | 2 , 

where D(X) is the Bayes 1 classifier given by 



D(X) = { C ( 2 | 1 ) q , p | (X) — C( 1 | 2) q 2 p 2 (X) }/{q } p , (X) + q 2 P 2 (X)} (15) 


and p.(x) is the density of IT., provided that f(x) has been chosen in the 

above fashion. This has been the motivation behind using C ( 2 1 T ) and — 

C ( 1 1 2 ) as the preassigned values of the separating function f(x). 

When q.'s are unknown and to be estimated, then N./(N,+lO can be used 
^ I i i z 

as an unbiased estimate of q., where N. is the number of elements belonging 
to T. out of a sample of size (Nj+N 2 ) from the mixed population. Then the 
criterion M(Nj, N 2 ) may be replaced by 

\ (f, W ) « E | W(x, c) - f(x)i 2 /(N 1 +N 2 ), 
xeT UT 2 

the mean square error criterion. Patterson and Womack have called M(N^, N 2 ) 
the mean square error criterion. 

Example . Let T^ = {[0, 1]"*", [1, 0]\ [2, 1]"'"} and 

T 2 = {[0, 0] T , [2, 0] T , [0, -1] T }. 

Then, assuming W(x, c) to be of the form 

2 2 

W (x , c) = CjXj + c 2 x 2 + c^x^ + C^X j + c^x 2 + Cg (16) 

and C (1 1 2) * C (2 1 1 ) = 1 , we can obtain the parameters Cj , c 2 , ..., c^ by 
the least square method. Solving the normal equations, we obtain 

c, = -11/8, c 2 = 1 A, c 3 = 5/8, c 4 = 17 / 8 , c $ = 1 and c 6 = -1 /k 

and W(x, c) = (1/8) [2x 2 — 1 lx 2 + 5*^ + 17Xj + 8x 2 — 2] . 
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A decision rule may be defined in order to assign xeJI^ whenever W(x, c) > 0 
and to n 2 otherwise. We may note that such a decision rule correctly clas- 
sifies all the observations in Tj and T . 

In this example, the sets and T 2 can be separated by a quadric such 
as parabola, ellipse, hyperbola or circle. That is why a classifier assumed 
to be of the form (16) of a second degree polynomial in x could classify all 
the points of Tj and T 2 correctly. But there are cases when the sets Tj 
and T 2 cannot be separated by any such curve. Then a second degree poly- 
nomial in x will not be so efficient, that is, the classifier will not be 
able to classify so many points of Tj and T^» 

4.2. Potential Function Method 

The potential function method was introduced for adaptive classifier 
design by Aizerman, Braverman and Rozonoer [l]. They assumed that a function 
f (x) , f (x) > 0 on T and f(x) < 0 on that is, sign f(x) = 1 on Tj and 
sign f(x) = — 1 on' T 2> that can be used to separate T^ and T 2 can be 
expressed in the form 

f(x) = CjQ.j(x) + . . . + c m Qjx), (17) 

where {Qj (x) , ...» Q m (x)} > s a finite set of orthogonal or orthonormal 
functions. This function f(x), which can be used as a classifier, can be 
obtained iteratively using the following algorithm. 

f p+1 (x) = f n (x) + s(n+l) K (x, x(n+l ) ) , f Q (x) = 0, (18) 

where s(n+l) = sign f(x(n+I)) —sign f n (x(n+l)), f p (x) is the approxima- 
tion to f(x) obtained at the nth step of iteration, x(n) is the nth obser- 
vation in a sequence of observations of known classification from the 
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mixture of the populations JIj and and K(x, y) is a potential function 
of the form 

m 

K (x, y) = E Q.(x) Q.(y). (19) 

i = l 


In order to avoid the problem of choice of the set {Qj (x) , 

Q^(x)} Braverman [3] made the following suggestions. Let us assume that 
f(x) can be represented by 


f (x) = Z c. Q.. (x) , 
i = 1 


2 C 2 < 

I 



( 20 ) 


where {Q.(x)} is a complete set of square integrable functions. Then 
K(x, y) in the above algorithm (18) should be replaced by a function 
K(x r y) of the form 

oo 

K(x, y) = E d. (x) Q.(y). (21) 

i = l 


Braverman showed that if we take 


K(x, y) = F (d(x, y) ) , 


( 22 ) 


where F is a continuous real function, d(x, y) is any function defining 
distance between x and y and F has a positive Fourier transform everywhere, 
then K(x, y) can be represented in the form (21)* For example, K(x, y) 
may be chosen as 


K(x, y) = exp { - a Z 


~ Y i ! 1 * 




X. 

I 


(23) 



^here a is any finite real number. Braverman has shown that the sequence 
f (x) , where x is an observation from IT ^ or converges in mean to the 


function f(x) 


In the expression (23) for K(x, y) we choose 0=1 and illu- 


strate the construction of f (x) 


n x, x 2 


Member s(n) 


f (x) 
n 


- exp ( — Ex.) 

2 exp { * — x 2 — (x 2 -1) 2 } + f^(x) 


f £ (x) 

f 2 (x) 

f 2 (x) 

f 2 (x) + 2 exp { — (Xj-1) 2 - x 2 } 


4.3. Stochastic Approximation Method 

Yau and Schumpert [10] have used stochastic approximation method in 
obtaining an adaptive classifier W(x, c) , as in (l), as an approximation to 
a function f(x) that takes the value 1 on T, and -1 on T„ using G{c) given 
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a special case of ( 5 ), as a pay-off function. G(c) has a unique minimum, 
or, the equation G‘(c) = 0 has a unique solution. But the density q^Pj(x) + 
q 2^2 ( X ) * from the mixture of and being unknown, G(c) is unknown. 

The solution c of G'(c) = 0 can be obtained iteratively using the following 
algorithm. 

c(n+l) = c(n) — a n {c T (n) <f> (x(n) — f (x (n) ) } $ (x(n)), 

where x(n) is the nth observation from the mixture of 31^ and 11^ $"*"(x) = 

[Q (x) , ..., (^(x)] ancl ( a n ) ' s a sequence of real numbers satisfying the 
cond i t i ons 

O0 OO 

2 

(a) 2a = 00 and 2 a < 00 . 
v n n 

n=l n-1 

{1/n} is an example of one such sequence. The convergence of c(n) to c 
follows from the fact that c(n) forms a Robbins —Monroe process. 
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ABSTRACT 


Two recursive estimators for the current 'mean 1- of two 
stochastic processes used in modelling patterns varying over 
space and/or time have been obtained. A brief survey of models 
and estimators so far proposed in literature has also been made. 



1 . 


Introduction 


In the analysis of remotely sensed data, such as multispectral scan- 
ner (mss) data, it is usual practice to assume that the data generated 
come from populations having multivariate normal distribution. When the 
parameters of the distributions, namely, the means and the covariance 
matrices, are known, then assuming the misclassif ication costs to be the 
same for all populations the Bayes ' classifier, the classifier with mini- 
mum probability of misclassif ication, is given by 

W (x) = log p ± x) - log p (x> + log (q^/q,.) ,i^j ,i, 
where m is the number of populations and p_^(x) is the density and q. is the 
prior probability of the ith population. When the parameters are unknown 
the usual and useful practice in designing classifiers is to replace the 
unknown parameters in (1) by their respective estimates obtained from a 
training sample, a sample of known classification. These sample based 
classifiers, which we shall refer to as estimate plug-in classifiers, do 
not in general minimize the probability of misclassif ication. But they 
are useful in large number of applications and theoretically desirable 
since we know (Anderson [2]) that as the training sample sizes increase 
the estimates converge with probability one to the value of the unknown 
parameter, provided that the observations in a training sample are iden- 
tically distributed. Thus for large training sample sizes and each 
training sample having identically distributed observations a sample 
based classifier may be expected to perform satisfactorily. 

The data obtained by remote sensing devices in the earth resources 
survey come from large areas and over a long period of time. Often the 



statistical characteristics of the data have been found to undergo changes 
over space and time due to variation in spatial and temporal conditions - 
Thus observations in a training sample collected from a large area and/or 
over a long period of time are not identically distributed. Therefore, 
the estimate plug-in classifiers in which the estimates are based on such 
samples can no longer be expected to perform satisfactorily in classifying 
patterns that vary over space or time. The performance of these estimate 
plug-in classifiers can be improved by updating the estimates whenever 
necessary. 

The estimate updating methods are applicable to all populations in 
the same way. So* without loss of generality, we can discuss these 
methods in the context of any single population. 

The efficiency of the updated estimates depends to a great degree on 
the accuracy of the statistical model chosen to represent the statistical 
nature of the varying patterns. But the complexity of the statistical 
model may on the other hand lead to estimates that are not useful for 
practical purposes because of computational inconvenience. In this paper 
we shall consider two general models for estimation and some "quick" 
methods of estimation based on exponential smoothing techniques (Box and 
Jenkins [3], Brown [4]). 

Throughout the paper we shall denote a sequence of observation vec- 
tors from the p- dimensional population II by Y^, Y^, . Y^, ... and 

assume 

\ - \ + V <» 

where X^’s are the signals, true patterns or means of Y^'s and W^’s are 



IU, 

p - variate normal random noise vectors distributed independently and 
identically as N(0, C) . Henceforth we shall write L(*) to designate the 
distribution law of the random vector or vectors in the parenthesis. 

2 . Autoregressive Model 

The pattern can be assumed to be a member of a r-th order auto- 

regressive process. For the sake of simplicity and mathematical tract- 
ability we assume that is a member of a first order autoregressive 

vector process given by 

\ - a *k-i ■ \.i < 2) 

where A is a pxp matrix of real numbers, Z 1 s are independently and 
identically distributed as N (0, E), Z^'s are independent of * s and 
L(xp = N(0, V). The process is known (Box and Jenkins [3]) to be sta- 
tionary if | A | < 1, where J A [ denotes the determinant of A, and nonsta- 
tionary otherwise. 

We shall denote by L(P) , L(P, Q) and L(p[q) the probability distri- 
bution of P, P and Q jointly and conditional distribution of P given Q 
respectively. We know that if L(P, Q) = N(p, D) , L(P) = N(p^, D^) , L(Q) 
N(p 2 , D 22 ) and 


D - 

" D ll D 12 

> u = 



D 21 °22 


_ V 2 



then 


th-f 

L(P|Q) = N(Q, *), where 

^ = U 1 + D 12 °22 ^ ~ V t) ^ 

* = D 11 " h 12 D 22 D 21* 

Therefore, frora (1) and (2) we have 

\ = X^, L(Y 1 |X 1 ) = N (X x , C), L(Y 1 ) = N (0, V+C) . 
and hence L(Y^, X^) = N(0, Q) , where 

q = r v+c v ' . 

L v v _ 

From (3) we now have L(X^jY^) - N (u^, Q^) , where 

v ± = ECXjlY^ = VCV+C)"^ (4) 

Qj_ = Cov (X x , x 1 |Y 1 ) = c(v+c) -1 v. 

Similarly we have 

y 2 = x 2 +w 2 = ax 1 +z 1 +w 2 , 

so, L(Y 2 |Y 1 ) = N (Ay lS AQ^+Z+C) and L (X^Y^ = N (Ay ± , AQ^+Z) 

As in (4), we obtain from (3) 

y 2 = E(X 2 |Y 2 , Y x ) = C (AQ 1 A T +Z+C)“ 1 A ± 

+ (AQ 1 A T +E)(AQ 1 A T +2+C) _1 Y 2 

Q 2 = Cov (X 2 , X 2 |Y 2 , Y 1 ) = C (AQ^+Z+C) -1 (AQ^+Z) 


and 



thus 


“k - E <\l\ V - c (AQ k . 1 A W)' 1 A V , k . 1 +(AQ k _ 1 

+ (AQ k _ 1 A T +Z) (AQ k _ 1 A T +I+C)~ 1 Y k _ 1 

and Q k = Cov (X^ xjY , Y ) = C (AQ k _ 1 A T +Z+C) _1 (AQ k _ ;i A T +Z) (5) 

From the above model x*e can obtain the model of Abramson and 

Braverman [1] by taking A = a I^, where 1^ is the pxp identity matrix 

and a is a real number, and that of Scudder [12] by taking A = 1^. 

If we assume | A [ < 1, so that the process is stationary, then we obtain 

the model proposed by Tamura et al [14] . Following Abramson and Braverman, 

2 

if we assume for slowly varying patterns that = AC? £ - {X /(L-A)} 

0<A<1, in' order that = then we have from (5) 

•■k - ^ »k-i + x \ 

Writing u k = ECX^^, Yp , we obtain 

U k = E ^ X k+lI Y k J * * * * Y p 

- E< \ +Z kl\ V * v 

Therefore = (1-A) + AY^.* (6) 

The above recursive formula is reminiscent of the exponentially weighted 
averages to be discussed in section 4. 

. In the model (2) it is implied that a new pattern is encountered at 
each observation, in other words, Y^, are observations on random 


vectors each having different means* The efficiency of the estimator (5) 




will be poor if the changes in pattern are few and far between. In Sec- 
tion 6 we will come back to the question of how to improve the estimates 

( 5 ). 


3. Chernoff - Zacks Model 

For analyzing time varying patterns Chernoff and Zacks [6] proposed 
a model and obtained minimum variance linear unbiased (MVLU) estimate of 
the current mean. They also obtained Bayes 1 estimator and an n ad hoc" 
estimator, a simplified version of Bayes' estimator. But they are unus- 
able for practical purposes, Bayes' estimator due to computational diffi- 
culties and the "ad hoc" estimator due to its being based on many restrictive 
assumptions. Chernoff and Zacks dealt with univariate observations. There 


we have Y 1 , X., W. as univariate random 
k x l 

variables . 

The model assumes 

Y. - X. + W. , 1 = 1, • * * j 

1X1 

n > 

(7) 

x ± = x m + J i z i 5 1 = 1> •••» 

n-1 

/"“N 

CO 


where W^'s are independently and identically distributed as N(0, 1), 

J 1 s are random variables having ‘ the value 1 if there is a change 
x 

in the mean or pattern X^ between ith and (i-KL)th observation and the 
value 0 otherwise and Z ^ is a random variable representing the amount 
of change when a change takes place. 

Let us write 


- “ 

Y 


w. 


- “1 


J- 

1 

, W = 

1 

, Z = 

. 1 

and J = 

. 1 

Y 

- ^ 

nxl 

w 

- *1 

nxl 

z , 

L o n J 

nxl 

> . 


Y 

nxl 


( 9 ) 



m 


Then noting that we can combine (7) and (8) to obtain the equation 


n-1 

Y. - X + W, + Z J, Z, , l<i < n-1, 
x n x . . k k 7 — — 

k-x 


= X + W , i=n , 

n n 


we can write 


Y = X e + W + MZ, 
n n 


where 


M = 
nxn 


J 1 J 2 “ 


J - 0 
n-1 

J . 0 
n-1 

J 0 
n-1 


and e - 
n 

nxl 


L J 


If we assume 


P( J k =i) = p k , k=l, n-1. 


and 


L(Z) = N(0 , aS) 


where 


S - 


I i 0 
n-1 

0 0 


then we obtain from (10) that 


L(Y|X n , J) = N(X n e n> Q(J)) , 


where 


Q(J) = I + o 2 MM T . 
n 


( 10 ) 


( 11 ) 


( 12 ) 


(13) 


f 



w 


From (11) we have 


n-1 

EQ(J) = £ E[p k (Q(J)|J k =l} + (l-P k ){Q(J)|J k =0>3 
k=.l 

= ! n + I p k E <MM T iJ k =l) 
lt=l 
n-1 

= 1+1 p. R(n, k), 
n k=l k 


(14) 


where R(n, k) is a nxn matrix whose upper left submatrix is E k , a 
kxk matrix all of whose elements are 1, and the other elements are zero. 

When changes in the mean occur almost always or when there are al- 
most no changes, the random vector J can be assumed to be distributed 
independently of Y. In that case we obtain from (13) and (14) 


where 


L(Y X ) = N(X e , V ) , 

1 n n n n 

n-1- 

V = EQ(J) = I + £ o p, R(n, k) 
n n - - K 

K~1 


(15) 

(16) 


If u is the current mean, the mean of Y , then using standard argu- 
n n 

merits it can be shown from (15) that 


T -1 v ; T -1 
p - e V Y/eV e 
n n n n n n 


(17) 


is a MVLU estimator of p . If v(n, j) denotes the sum of the elements 

n 

-1 

of the ith column of V , then p can be written in the form 
J n n 


n-1 


n-1 


n 


[£ v(n, j )Yj + Y ]/[£ v(n, j) + 1] . 

3-1 3-1 


(18) 



( 16 ) 




When we have n+1 observations, we can obtain in a way similar to 


n+1 


J „ +1 + jj R <" +1 ' « 

k-1 


T 

t*;°i 



rv 

0" 

+ a p 

"e “ 


n 


r n 

n 


_0 

1_ 


0- 


We know (Rao [11] p. 29) that if 

E = A + U V T , 
pxp pxp pxl lxp 


“1 “1 -1 T -1 T -1 

then B « A - A UV A /(I + V A U) , 


(19) 


Therefore , 


V 


-1 

n+1 


where K 


n+1 


”v 1 o 1 
11 


0 1 

l+a 2 p 

fv(n, 1)1 [v (n 


a 2 P 


n 


K 


v(n, n) 
0 


T v ,-l "n+1’ 
e 

n n n n 


[v(n, 1) ... v(n, n) 0] 


(20) 


So , 


v(n+l, n+1) = 1 


2 n 

v(n+l, i) = v(n, i) / [1 + a p E v(n, k) ] , l<i<n, 

n k=l 

n ^ n « n 

and 1 + I v(n+l, i) = [l+(l+a p ) E v(n.i)]/[l + a p E v(n, k) ] 
i-1 n i=l 1 n k»l 


n 


n 


J n+1 " I £ , v(n+1 - k > Y k + Y n+l 1/[1 + f /<" +l > k) l 
k=l k=l 


Therefore , 



/0 


n 


n 


[Z v(n, i)] y + [1 + cr p E v(n, i)] Y 
i-1 " n jol ^ 

2 n 

1 + (1 + a p ) X v(n, i) 
i-1 


(1 - A ) u + A Y , 
n n n ti-rl 


( 21 ) 


2 n 2 n 

where A = {1 + a p Z v(n, i) } /{ 1 + (1+a p ) Z v(n,i)}. 

Ji i=l 1=1 


( 22 ) 


Using (22) we can now recursively estimate the current mean. 


4. Exponentially Weighted Moving Average 

In analyzing observations on space and time varying patterns, espe- 
cially time series data, it has been a popular practice (see Brown [4]) 
to use exponentially weighted moving average (EWMA) (Box and Jenkins [3]) 
as the current "mean", location or "level" of the process because of its 
success in a variety of applications. Let °^ servat ^ ons 

at *. t-1, t (a stochastic process in discrete time). Then an exponen- 
tially weighted moving average (EWMA) is a predictor or forecast of a 
future level at t + h derived by weighting past observations "exponen- 
tially" (or geometrically) and expressed in the form 

CO 

Y (t, h; A) = X E (1-A) r Y , |xf <1 (23) 

r=0 

= A Y t +(1-A) Y (t-1, h; A), [A:). <1. (24) 

For h=l, we shall write 

Y(t, 1; A) = Y_(A). 


(25) 



Thus from (24) and (25) we have 


Y t (X) = XY t + (1-A) Y t _ 1 (A) (26) 

= AZ (1-A) r Y 
r=0 t_r 

Muth [10] has shown that Y (A) , as given by (26), is optimal for some 
A for the following nonstationary moving average process. Let 


Y 

t 


W 


X 


t 



(27) 


2 

where W^'s are independently distributed as N(0, a ) and s are 

2 - 

independently distributed as N(0, t )» Then the predictor Y^ (X) 
given by (26) minimizes the error variance 


E <Vr V x »' 


when A is given by 



t 2 1/2 
(1 + -^) 


4o 


(28) 


(29) 


From (28) it also follows that 


Y t (A) = E(X t+1 |Y t , Y fc _ 1 , ...)■ 


(30) 


When Y is a pxl vector, L(W fc ) = N(0, C) , L(Z fc ) = N(0, Z), 
then following Muth, it can be shown that Y t (A) minimizes the scatter of 
error given by 



provided we take 

A = 1 + (|l|/2]ci) - J £ | / i C { a/i + | Z ) /4 j C | . (31) 

In passing, it should be noted that the choice of the weight A is 
vital in the definition of EWMA. In many situations EWMA's may not be 
optimal, because the underlying process may not be of the form (27), or 
even if the process is of the form (27) , the matrices Z and C are 
unknown, so that it may be impossible to obtain the exact value of A 
given by (29) or (31) . Even then they are attractive because they are 
easy to compute. 

Motivated by the work of Abramson and Braverman [1] , especially by 
the equation (6) which has been derived from a model of the form (27) and 
perhaps following the popular practice, Kriegler and Horwitz [8] have 
proposed that for simultaneously updating the current mean of several 
populations whose covariance matrices do not change any one of the follow- 
ing three algorithms should be adopted. 

(1) Exponentially Weighted Running Estimates 

If an observation A has been recognized as one coming from the 
class IL, then the present mean of this class is to be updated to 

a new value where 

M' = (1-A) M i + AY, (32) 


and A is a constant. 





Kriegler and Horwitz have recommended that X should be given a 
value between 0.001 and 0.003. Evidently such small X will heavily 
weigh the past observations, especially the present mean and will 

attach very little weight to the current observation. Unless the pat- 
terns are extremely slowly varying, such small X will make the updated 
mean MY a very poor representative of the current pattern. 

(2) Exponentially Weighted Running Estimates With Interaction 
Assuming that at any time the mean M of the class JL is related 
to the sum M of means of all M classes by the equation 

M = mp ± M i , (33) 


where 

p i 

is constant for 

all - ' time, the present mean M_. of any 

clsss 

n. 

j 

should be updated 

to the value Mf , where 

j 


M j = P i M i /p j ( 3/ 0 

and is the current updated mean of IK, updated according to (32). 

The assumption (33) is hard to justify. 

(3) Posterior Probability Weighted Estimates 

Let IK denote the fixed covariance matrix of IK, Then according 
to this algorithm an observation Y^_ is assigned to the class IK if 
for some preassigned threshold value t^, 

W >L j ( V and W < V 

W = I K i I ex P {~V Y t )/2} 


where 



and 


Q.(Y) = (Y-M.) T K" 1 (Y t -M.) . 

The mean should be updated to M' given by 

= 1 + yR ± (Y ) (Y -M ) 

= U - A(t, i)} M i + A(t, i) Y t , 
where R (Y ) = L (Y )/ L (Y ) 

j=l J 



(35) 


and A(t, i) = yR^Y^), y being a constant (0<y<l) . 

The updating formula (21) based on Chernoff ~ Zaclcs model and formula 
(35) , both resemble EWMA, except that the weights are now variable and to 
be updated each time* The predictor (21) has been shown to be MVLU. But 
the predictor given by (35) may not enjoy an}^ such property. It is 

not known for what kind of process will be optimal in some sense* 

Kriegler and Horwitz have suggested some values of and y on 

empirical grounds, but have failed to give a specific rule for the selec- 
tion of these quantities. The performances of the predictors (35) in 
different situations are yet to be determined and compared. 

The work of Kriegler and Horwitz [8] motivated Chang [5] to propose 
updating formulae for the mean and covariance matrix using variable 

weights* Let Y,_, *.*, Y. / • \ denote the observation from the ith 
jl jn(j) J 

training field, a set of identically distributed random vectors. Then 
the updated mean Mf is given by 


MC “ Y - 
J J 



(l-Yj) 



(36) 



where 


M J ' < I 1 l + - + Vu> ,/na)> 


Y. = a. - N. -/ N . 5 
3 J“1 J“1 J 


- n(l) + . » . + n(k) 


and is a number satisfying 0 £ a j_i — an( ^ t0 S uesse< 3 

from the nature of the data. The covariance matrix is to be updated to 


K. where 
J 


k. • {n . / (n ,- i) } (q: - m: m. ), 

J J J J J J 


(37) 


and 


“i ‘ h-i + V 

n(j) T 

Q, = L Y Y,,/ n(j)- 


i=l 


J3- J1 


When the patterns are varying, the updated "mean." M( given by (36) is 
not an unbiased estimate of the current mean. It is not known in what 
way these updated estimates are optimal. The other disadvantage of this 
algorithm is that it leaves the choice of y to the gues:: of the user* 

5 . Updating By Kalman Filters 

Consider the dynamic relationship 


\ - W + \ 


\ = AS k-i + z k-i 


(38) 


satisfied by the pxl observation vectors Y^, , ... and the 

qxl state vectors S^, ..., S^, ..., where W^'s are pxl random 




noise vectors independently distributed as N(0» C), Z k 's are qxl 
random vectors distributed independently of each other and of w k ' s as 
N(0, Q) and the pxq matrix and qxq matrix A may be time vary 

ing, but are known matrices of real numbers. It is well known (Lee [9]) 

A 

that an estimate of given Y^, . such that 


E(S^_) - E(S k * • m Yj) 


and E[(S k "S k ) (S fc — S fc ) ] is minimum 


(39) 


can be obtained, using discrete Kalman filter, recursively as 

5 k - “k-1 + p k ^ c ‘ 1[Y k - VVi 1 - <40> 

P k = [(A P k _ 1 A T + Q)" 1 + li^C -1 (41) 


- Cov (S k> S fc [Y fc , Y x ) 


Assuming covariance matrices of all populations to be the same and 

not undergoing any change. Crane [7] used the estimation method by Kalman 

filters for simultaneously updating the means of the populations 

]l , .... IT . Crane considered the following model: 

1 m 


“ Vk + V 

(42) 

= s k-l + z k-l’ 

(43) 


“k-'i ®V 


(44) 




where is a mxl matrix with ith row (1=1, . . - , m) as 1 if Y^ 

has been recognized to be coming from II ^ and zero otherwise and A <$D B 
denoting the Kronecker product of the matrices A and B (Anderson [2]). 
It has also been assumed that 


Q = 


R C, 

1 r. 


where 


R 

mxm 




r 2 1 




(45) 

(46) 


The state vector is a m pxl vector of m stacked subvectors of 

pxl vectors of present means (prior to updating) . Let ^ and Z^^ 

denote the ith pxl vector component of and Z ^ respectively. 

Then assuming to be an observation from IU, we obtain from (42), 

(43) and (44) that 


- *<*> 
’ \ 


+ w. 


(47) 


4 i} + 4 - 1 + 4-1 


(48) 


This is the process for which EWMA s are optimal. Thus it is evident 
that Crane’s model is a generalization of the above model so that in 
simultaneously updating the means of all populations their interaction 


has been taken into account. 



Noting that 


tv 


* < P k-l + « ^ < P k-l + « H k + c > 


we now obtain from (40) that 


K ■ K-i + < p k-i + « < ' H k < p k-i + ® H 

Further assuming L(S^) = N(0> P ) and 

P = T © C , 
o o ? 

mpxmp mxm pxp 


T 

k 


+ C]" 1 (Y k -H k S k _ 1 ). (49) 


(50) 


we can obtain from (45) that 

Pl+O = (P^ + 11 P C’ 1 H^- 1 + Q 


= (T 1 + M-M?) 1 ® C + R ® C 

O 1 1 


T M M T 

= (T - -2-^ -} & C + R ® C 

0 i+m:t m, 

1 O 1 


= T i ® c 

P 2 +Q = [(P^Q)" 1 + C _1 H 2 ] _1 + R © C 


and 



/ fol 


= (T 1 1 + M 2 M 2> 1 © c + R © c 


where 


and 


T 

T M M T 
12 2 1 


T, = 


T rt = 


T - 
o 


T i " 


T 


1+flL 

l 

1 z 

© c. 


T 

T M-M- 

T 

oil 

o 

T 


1+M 1 T 

M, 

1 o 

1 


T- 

12 2 



} © C + R © C 


1+M 2 T 1 M 2 


By induction, we can obtain 

\ + Q = T k © c , 

mpxmp mpxrap mxm pxp 

where ‘ Tk = T, - + R (sl) 


Using (51) we can simplify (49) and obtain 


K - K-i + « i+f £ \-i v ' 1 Vi \ 




V 


( Wk-i>- <52) 


Crane has suggested some n ad hoc 11 value for and r^ in (46) 

in order to define the interaction matrix R. But it seems more reason- 
able to estimate them from a part of the sample. 



The above estimation procedure will be clear from the following 


numerical example. 


Example . Let m = 2, p = 2, T q “ C = I ^ - 0.01, - 0.02, 

= [1 2] e n v Y T 2 = [-1 1] e n 2 . Then 

M* = [1 0], = [0 1], 

mT T M-. = 1, 

1 o 1 

T- = I - {I pi [1 0] I }/ 2 + 0.01 Q 0.02] 

L°J jO.02 1 J 


1 t 

O H 

<Q 

ro .5 o'] r o.oi 0.002] 
L 0 oj i 0.002 0.01 J 

r o. 5 i 

0.002] 

L 0.002 

1.01 J 5 

1.01, 



T 2 = T 1 " fT l "°" [ ° 1] T l }/ 2,01 + R 


T 1 

-L O. 


"0. 

0.002" 

"0.51' 

0 . 002 " 

0 

1.01 

0.002 

1.01 


- 

L 



+ R. 


6 . Detection of Change in Mean 

The models on which we have based our estimation process so far 
implicitly assume that the means change almost at each observation. If 
this is the case, then the updating of means using the procedures so far 
discussed is just. But when we are considering spatial variation and 



have observations from points spaced not too far from each other, then it 

is more likely that the sequence of observations Y^, . Y^ can be 

grouped as {Y^ 3 Y^ , * * * } Y^ ^ { ^k. +1 9 ^k ~i~2 9 * ' * » Y^ } > • • • such that 

members in the same group have the same mean while members in different 

groups have different means. In that case it will be more reasonable to 

use the arithmetic mean (Y. ,- + ...+ Y T )/(k.,- - k.) as the mean 

k . +1 k . . , l+l l 

i l+l 

of Yj (1<_+1 <_ j k^ + ^) than to use the mean obtained by the updating 
process. But often the points K , k^ , ... at which the changes in mean 
take place is unknown. Therefore we have to use some statistical test 
in order to determine the points of change. 

When Y 2 5 Y n are P x ^ fandom vectors • independently dis- 

tributed normally with the same covariance matrix, say I , then for test- 

P 

ing the hypothesis 

H : p- = y* = ♦ . . = y (equality of means) 
o 1 2 n 


against : y^^ = . . . = y ^ U r+1 = . . . = y , 

where the change point r is unknown and 1 < r < N-1, Sen and Srivastava 

[13] have used the test statistic U* given by 

_ 9 n-1 n-1 _ n-1 

n* “ N E ( Z (Y - Y» ( S (Y - Y) ), 
i-1 J j = i J 

where Y = (Y^ + ... + Y^)/N. They have also given the asymptotic 

percentile of U* for p~2 through p=8. Chernoff and Zacks [6] have 
also given a test for p=l. For other references of work in this connec- 


tion we refer to [13]. 
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Addendum to Papers 1, 4 and 5 


Consider the matrix P for the two crop situation. Say, 


P 


p(i|i) 

P(2 | 1) 


P(l|2) 

P(2j2)_ 


A A 

where P(l|l) = x/i^ P(2 | 2> = y/N^ and 


A A A A 

P(2|l) = 1 - P(l|l), P Cl 1 2) = 1 - P(2 1 2) . x is the number correctly 
classified into population one and y is the number correctly classified 
into population two. Since X and Y are independently distributed as bi- 

A 

nomial variates, the probability that P is singular is positive. This is 
illustrated for the case 535 3 where represents the probability of 

correctly classifying an observation in population i. 

A 

Since P is singular iff xy - (3-x) (3~y) = 0, iff x+y = 3, there are 4 

A 

points yielding P singular. They are (1,2), (2,1), (0,3), (3,0) and the 

~ 2? 223333 

probability P is singular is 9 + 9 p l q l p 2 q 2 + q l P 2 + p l q 2‘ In 

-A 

general, if there are N^+l points which yield P singular. 

Hence, an alternate method must be used to estimate P. One approach 
is to estimate vu , for each population from training samples and then 

A A. 

Pj(xjp.,E.) dx, where is the region 
i 

corresponding to an observation being classified into the ith population, 
and p_, is the (continuous) density function of the jth population. The 
probability that P is non-singular with probability one can be established 
using advanced probabilistic arguments. 

A 

Another approach is to use the pseudoinverse of P rather than the inverse. 


estimate P(i|j) by P(ijj) = / 

R 


However, this approach would affect all the analyses done and the results may 
be difficult to interpret. This approach is probably the best but needs to 
be studied to assure that subtleties have not been ignored. 



Finally, one could make the analysis conditional on the fact that P is 

A 

A 

non-singular. Formulas for variances and mean squared error of P would 
necessarily be modified to incorporate this condition; yet the analysis 
would conceptually be straightforward and follow the arguments now presented. 



